Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question   Answering

Jie Ma; Min Hu; Pinghui Wang; Wangchun Sun; Lingyun Song; Hongbin Pei,; Jun Liu; Youtian Du

arXiv:2404.12020·cs.CV·March 6, 2025·3 cites

Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering

Jie Ma, Min Hu, Pinghui Wang, Wangchun Sun, Lingyun Song, Hongbin Pei,, Jun Liu, Youtian Du

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a new dataset, MUSIC-AVQA-R, and a debiasing architecture for audio-visual question answering, significantly improving robustness and state-of-the-art performance while addressing dataset biases.

Contribution

The paper presents a novel dataset with distribution shifts and a multifaceted debiasing strategy, advancing robustness in AVQA systems.

Findings

01

Achieved 9.32% improvement on MUSIC-AVQA-R

02

Demonstrated robustness against dataset biases

03

Validated plug-and-play capability of the debiasing strategy

Abstract

Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robustness. Furthermore, current datasets may not provide a precise diagnostic for these methods. To tackle these challenges, firstly, we propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and subsequently introducing distribution shifts to split questions. The former leads to a large, diverse test space, while the latter results in a comprehensive robustness evaluation on rare, frequent, and overall questions. Secondly, we propose a robust architecture that utilizes a multifaceted cycle collaborative debiasing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

reml-group/music-avqa-r
pytorchOfficial

Videos

Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Subtitles and Audiovisual Media