Progressive Spatio-temporal Perception for Audio-Visual Question Answering
Guangyao Li, Wenxuan Hou, Di Hu

TL;DR
This paper introduces PSTP-Net, a progressive network that identifies key spatio-temporal regions in videos for improved audio-visual question answering by focusing on question-relevant content.
Contribution
The paper proposes a novel multi-module network that progressively selects relevant video segments and regions, enhancing AVQA performance by filtering out irrelevant information.
Findings
Outperforms existing AVQA models on MUSIC-AVQA and AVQA datasets.
Demonstrates improved accuracy and efficiency in question answering.
Validates the effectiveness of progressive spatio-temporal perception modules.
Abstract
Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Advanced Image and Video Retrieval Techniques
