Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction

Advait Gosai; Tyler Vuong; Utkarsh Tyagi; Steven Li; Wenjia You; Miheer Bavare; Arda U\c{c}ar; Zhongwang Fang; Brian Jang; Bing Liu; Yunzhong He

arXiv:2512.14865·cs.SD·December 18, 2025

Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction

Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda U\c{c}ar, Zhongwang Fang, Brian Jang, Bing Liu, Yunzhong He

PDF

Open Access 1 Datasets

TL;DR

Audio MultiChallenge introduces a comprehensive benchmark for evaluating end-to-end spoken dialogue systems on natural multi-turn conversations, highlighting current model limitations and guiding future improvements.

Contribution

It extends existing benchmarks by incorporating multi-turn, multi-modal audio evaluation axes, including robustness to speech repairs and ambient cues, with a large-scale, natural speech dataset.

Findings

01

Frontier models perform poorly on the benchmark, with the best achieving only 54.65% pass rate.

02

Models struggle most with new axes like Voice Editing and Audio-Cue challenges.

03

Self Coherence decreases as audio context length increases.

Abstract

End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ScaleAI/audiomc
dataset· 251 dl
251 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis