Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

Guangyao Li; Henghui Du; and Di Hu

arXiv:2407.20693·cs.CV·July 31, 2024

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

Guangyao Li, Henghui Du, and Di Hu

PDF

Open Access 1 Repo

TL;DR

This paper introduces TSPM, a model that enhances audio-visual question answering by perceiving key cues through temporal and spatial modules, improving understanding and answer accuracy in complex videos.

Contribution

The paper proposes a novel Temporal-Spatial Perception Model that aligns questions with relevant audio-visual cues using semantic prompts and cross-modal interaction.

Findings

01

Outperforms existing methods on multiple AVQA benchmarks.

02

Effectively identifies critical segments and targets for complex questions.

03

Improves scene understanding and answer accuracy in multimodal videos.

Abstract

The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gewu-lab/tspm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques