Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda U\c{c}ar, Zhongwang Fang, Brian Jang, Bing Liu, Yunzhong He

TL;DR
Audio MultiChallenge introduces a comprehensive benchmark for evaluating end-to-end spoken dialogue systems on natural multi-turn conversations, highlighting current model limitations and guiding future improvements.
Contribution
It extends existing benchmarks by incorporating multi-turn, multi-modal audio evaluation axes, including robustness to speech repairs and ambient cues, with a large-scale, natural speech dataset.
Findings
Frontier models perform poorly on the benchmark, with the best achieving only 54.65% pass rate.
Models struggle most with new axes like Voice Editing and Audio-Cue challenges.
Self Coherence decreases as audio context length increases.
Abstract
End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis
