WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild
Linhao Zhang, Jian Zhang, Bokai Lei, Chuhan Wu, Aiwei Liu, Wei Jia, Xiao Zhou

TL;DR
WildSpeech-Bench introduces a comprehensive evaluation benchmark for end-to-end speech LLMs, addressing speech-specific challenges and enabling more accurate assessment of model performance in real-world spoken scenarios.
Contribution
It is the first benchmark specifically designed to evaluate end-to-end speech LLMs with speech-specific phenomena and a query-aware evaluation method.
Findings
Significant performance differences across speech models.
Enhanced evaluation accuracy with query-aware methods.
Diverse real-world speech data improves assessment.
Abstract
Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech's unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we introduce the first comprehensive benchmark designed to systematically evaluate end-to-end speechLLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation…
Peer Reviews
Decision·Submitted to ICLR 2026
- The authors develop a pipeline to create test samples with life-like synthetic speech by combing TTS and noise - The authors complement the synthetic samples with real recorded speech - The authors show that SOTA SLMs exhibit significantly degraded performance when processing samples with noise, showing their fragility despite strong reported results
- The set of benchmarked systems is relatively small, missing notable models like Moshi - My main concern is the lack of evaluation on the realism of the synthetic noisy audio. All of the results of this paper are dependent on this factor. Since real speech is already used for a subset, why not synthesize another small subset with the real-speech as the speaker prompt? Then you can have a direct comparison of the impact of real speech vs synthetic noisy samples.
1. The paper propose the evaluation of end-to-end SpeechLLMs, which are rapidly emerging but lack standardized benchmarks. 2. The benchmark construction is well-motivated and systematic, combining real user queries from WildChat, voice cloning, paralinguistic diversity, and noise realism. 3. The use of customized, query-specific rubrics improves correlation with human judgments (Pearson r = 0.86), a clear methodological improvement over prior automatic evaluations. 4. The comparison of multiple
1. The dataset (1.1 k samples) is small relative to modern benchmarks. While quality is prioritized, the paper lacks analysis of statistical coverage or how well this scale generalizes across domains or accents. Expansion beyond English is critical. 2. Despite inclusion of 100 human recordings, most data come from TTS (CosyVoice), which constrains prosodic and emotional realism. Results may overestimate model robustness to human variability. 3. Although the authors mitigate transcription bias vi
The benchmark is built upon a dataset of real user interactions (WildChat), ensuring the queries reflect genuine spoken language use cases and user behavior. The benchmark systematically incorporates paralinguistic phenomena and acoustic challenges that are largely ignored by text-centric benchmarks, providing a more comprehensive test of model capabilities.
1. The benchmark is relatively small (1100 queries, with only 100 for paralinguistic features) and relies heavily on synthesized speech from only two base speakers. 2. The benchmark focuses exclusively on single-turn interactions, omitting the critical aspect of multi-turn conversations, which is a fundamental element of conversational AI and a key use case for voice assistants. 3. Both the dataset curation and the core evaluation protocol depend heavily on another LLM (GPT-4o-mini). This re
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
