Answering Diverse Questions via Text Attached with Key Audio-Visual   Clues

Qilang Ye; Zitong Yu; Xin Liu

arXiv:2403.06679·cs.CV·March 12, 2024·1 cites

Answering Diverse Questions via Text Attached with Key Audio-Visual Clues

Qilang Ye, Zitong Yu, Xin Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a mutual correlation distillation framework for audio-visual question answering that enhances multimodal fusion and reduces overfitting, leading to improved performance on AVQA datasets.

Contribution

The proposed MCD framework effectively aligns audio-visual-text features and decouples dependencies, improving AVQA accuracy and robustness over existing methods.

Findings

01

Outperforms state-of-the-art AVQA methods on Music-AVQA and AVQA datasets.

02

Removing deep audio-visual features during inference reduces overfitting.

03

Hierarchical local feature extraction enhances question relevance.

Abstract

Audio-visual question answering (AVQA) requires reference to video content and auditory information, followed by correlating the question to predict the most precise answer. Although mining deeper layers of audio-visual information to interact with questions facilitates the multimodal fusion process, the redundancy of audio-visual parameters tends to reduce the generalization of the inference engine to multiple question-answer pairs in a single video. Indeed, the natural heterogeneous relationship between audiovisuals and text makes the perfect fusion challenging, to prevent high-level audio-visual semantics from weakening the network's adaptability to diverse question types, we propose a framework for performing mutual correlation distillation (MCD) to aid question inference. MCD is divided into three main steps: 1) firstly, the residual structure is utilized to enhance the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rikeilong/mcd-foravqa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Video Analysis and Summarization · Advanced Text Analysis Techniques

MethodsKnowledge Distillation · ALIGN