Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, Jindong Gu

TL;DR
This paper introduces ADU-Bench, a comprehensive benchmark with over 20,000 audio dialogues to evaluate large audio-language models' open-ended understanding across multiple scenarios, skills, languages, and ambiguity types.
Contribution
It presents the first dedicated benchmark for open-ended audio dialogue understanding, including ambiguity handling and multilingual evaluation, filling a critical gap in LALM assessment.
Findings
Existing LALMs struggle with mathematical symbols and formulas.
LALMs have difficulty understanding human roleplay and behavior.
Challenges remain in multilingual comprehension and ambiguity resolution.
Abstract
Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and dialogue systems · Speech Recognition and Synthesis
