Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
Guangyao Li, Henghui Du, and Di Hu

TL;DR
This paper introduces TSPM, a model that enhances audio-visual question answering by perceiving key cues through temporal and spatial modules, improving understanding and answer accuracy in complex videos.
Contribution
The paper proposes a novel Temporal-Spatial Perception Model that aligns questions with relevant audio-visual cues using semantic prompts and cross-modal interaction.
Findings
Outperforms existing methods on multiple AVQA benchmarks.
Effectively identifies critical segments and targets for complex questions.
Improves scene understanding and answer accuracy in multimodal videos.
Abstract
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
