Answering Diverse Questions via Text Attached with Key Audio-Visual Clues
Qilang Ye, Zitong Yu, Xin Liu

TL;DR
This paper introduces a mutual correlation distillation framework for audio-visual question answering that enhances multimodal fusion and reduces overfitting, leading to improved performance on AVQA datasets.
Contribution
The proposed MCD framework effectively aligns audio-visual-text features and decouples dependencies, improving AVQA accuracy and robustness over existing methods.
Findings
Outperforms state-of-the-art AVQA methods on Music-AVQA and AVQA datasets.
Removing deep audio-visual features during inference reduces overfitting.
Hierarchical local feature extraction enhances question relevance.
Abstract
Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. Although mining deeper layers of audio-visual information to interact with questions facilitates the multimodal fusion process, the redundancy of audio-visual parameters tends to reduce the generalization of the inference engine to multiple question-answer pairs in a single video. Indeed, the natural heterogeneous relationship between audiovisuals and text makes the perfect fusion challenging, to prevent high-level audio-visual semantics from weakening the network's adaptability to diverse question types, we propose a framework for performing mutual correlation distillation (MCD) to aid question inference. MCD is divided into three main steps: 1) firstly, the residual structure is utilized to enhance the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Video Analysis and Summarization · Advanced Text Analysis Techniques
MethodsKnowledge Distillation · ALIGN
