Learning to Answer Questions in Dynamic Audio-Visual Scenarios
Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di, Hu

TL;DR
This paper introduces a new large-scale dataset and a spatio-temporal grounded network for audio-visual question answering, demonstrating improved multimodal reasoning over videos.
Contribution
The paper presents the MUSIC-AVQA dataset and a novel network architecture for AVQA, advancing multimodal understanding and reasoning in dynamic scenes.
Findings
Our model outperforms recent AVQA approaches.
The dataset enables comprehensive evaluation of audio-visual reasoning.
Multisensory perception enhances question answering performance.
Abstract
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Video Analysis and Summarization
