Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

Chao-Han Huck Yang; Sreyan Ghosh; Qing Wang; Jaeyeon Kim; Hengyi Hong; Sonal Kumar; Guirui Zhong; Zhifeng Kong; S Sakshi; Vaibhavi Lokegaonkar; Oriol Nieto; Ramani Duraiswami; Dinesh Manocha; Gunhee Kim; Jun Du; Rafael Valle; Bryan Catanzaro

arXiv:2505.07365·cs.SD·March 10, 2026

Multi-Domain Audio Question Answering Benchmark Toward Acoustic Content Reasoning

Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a comprehensive multi-domain audio question answering benchmark designed to evaluate and improve audio-language models' reasoning abilities across diverse acoustic scenes, from bioacoustics to complex soundscapes.

Contribution

It presents a new multi-domain AQA dataset with diverse subsets, evaluation protocols, and baseline systems to advance acoustic content reasoning in AI models.

Findings

01

Baseline models show significant variation in performance across subsets.

02

Preliminary results highlight challenges in complex acoustic reasoning.

03

The benchmark sets a foundation for future improvements in audio understanding.

Abstract

We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

PeacefulData/2025_DCASE_AudioQA_Official
dataset· 77 dl
77 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training