Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Kuofeng Gao; Shu-Tao Xia; Ke Xu; Philip Torr; Jindong Gu

arXiv:2412.05167·cs.AI·July 29, 2025

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, Jindong Gu

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces ADU-Bench, a comprehensive benchmark with over 20,000 audio dialogues to evaluate large audio-language models' open-ended understanding across multiple scenarios, skills, languages, and ambiguity types.

Contribution

It presents the first dedicated benchmark for open-ended audio dialogue understanding, including ambiguity handling and multilingual evaluation, filling a critical gap in LALM assessment.

Findings

01

Existing LALMs struggle with mathematical symbols and formulas.

02

LALMs have difficulty understanding human roleplay and behavior.

03

Challenges remain in multilingual comprehension and ambiguity resolution.

Abstract

Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

KuofengGao/ADU-Bench
dataset· 1.6k dl
1.6k dl

Videos

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models· underline

Taxonomy

TopicsMusic and Audio Processing · Speech and dialogue systems · Speech Recognition and Synthesis