WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild

Linhao Zhang; Jian Zhang; Bokai Lei; Chuhan Wu; Aiwei Liu; Wei Jia; Xiao Zhou

arXiv:2506.21875·cs.CL·September 29, 2025

WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild

Linhao Zhang, Jian Zhang, Bokai Lei, Chuhan Wu, Aiwei Liu, Wei Jia, Xiao Zhou

PDF

3 Datasets 3 Reviews

TL;DR

WildSpeech-Bench introduces a comprehensive evaluation benchmark for end-to-end speech LLMs, addressing speech-specific challenges and enabling more accurate assessment of model performance in real-world spoken scenarios.

Contribution

It is the first benchmark specifically designed to evaluate end-to-end speech LLMs with speech-specific phenomena and a query-aware evaluation method.

Findings

01

Significant performance differences across speech models.

02

Enhanced evaluation accuracy with query-aware methods.

03

Diverse real-world speech data improves assessment.

Abstract

Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech's unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we introduce the first comprehensive benchmark designed to systematically evaluate end-to-end speechLLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The authors develop a pipeline to create test samples with life-like synthetic speech by combing TTS and noise - The authors complement the synthetic samples with real recorded speech - The authors show that SOTA SLMs exhibit significantly degraded performance when processing samples with noise, showing their fragility despite strong reported results

Weaknesses

- The set of benchmarked systems is relatively small, missing notable models like Moshi - My main concern is the lack of evaluation on the realism of the synthetic noisy audio. All of the results of this paper are dependent on this factor. Since real speech is already used for a subset, why not synthesize another small subset with the real-speech as the speaker prompt? Then you can have a direct comparison of the impact of real speech vs synthetic noisy samples.

Reviewer 02Rating 4Confidence 4

Strengths

1. The paper propose the evaluation of end-to-end SpeechLLMs, which are rapidly emerging but lack standardized benchmarks. 2. The benchmark construction is well-motivated and systematic, combining real user queries from WildChat, voice cloning, paralinguistic diversity, and noise realism. 3. The use of customized, query-specific rubrics improves correlation with human judgments (Pearson r = 0.86), a clear methodological improvement over prior automatic evaluations. 4. The comparison of multiple

Weaknesses

1. The dataset (1.1 k samples) is small relative to modern benchmarks. While quality is prioritized, the paper lacks analysis of statistical coverage or how well this scale generalizes across domains or accents. Expansion beyond English is critical. 2. Despite inclusion of 100 human recordings, most data come from TTS (CosyVoice), which constrains prosodic and emotional realism. Results may overestimate model robustness to human variability. 3. Although the authors mitigate transcription bias vi

Reviewer 03Rating 6Confidence 3

Strengths

The benchmark is built upon a dataset of real user interactions (WildChat), ensuring the queries reflect genuine spoken language use cases and user behavior. The benchmark systematically incorporates paralinguistic phenomena and acoustic challenges that are largely ignored by text-centric benchmarks, providing a more comprehensive test of model capabilities.

Weaknesses

1. The benchmark is relatively small (1100 queries, with only 100 for paralinguistic features) and relies heavily on synthesized speech from only two base speakers. 2. The benchmark focuses exclusively on single-turn interactions, omitting the critical aspect of multi-turn conversations, which is a fundamental element of conversational AI and a key use case for voice assistants. 3. Both the dataset curation and the core evaluation protocol depend heavily on another LLM (GPT-4o-mini). This re

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.