SpeechJudge: Towards Human-Level Judgment for Speech Naturalness
Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu

TL;DR
SpeechJudge introduces a large-scale human feedback dataset, a benchmark, and a reward model to improve speech naturalness judgment, addressing the gap between current metrics and human perception in speech synthesis.
Contribution
The paper presents SpeechJudge, a comprehensive framework including a dataset, benchmark, and reward model that significantly enhances alignment with human judgments in speech naturalness.
Findings
SpeechJudge-GRM achieves 77.2% accuracy on the benchmark.
Existing metrics and AudioLLMs perform poorly, with less than 70% agreement.
The reward model can be used to improve speech synthesis models' alignment with human preferences.
Abstract
Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness…
Peer Reviews
Decision·ICLR 2026 Poster
1. Provides the community with a much needed resource for building automatic TTS evaluators that can be incorporated into the TTS training. 2. The work is comprehensive in that it shows an entire pipeline with data collection, evaluation of existing models, building a new evaluation model, and then training a new TTS system on it. The latter stages lend credibility to the primary resource - the dataset.
1. I have some concerns about the ability of any annotator to assess fine-grained naturalness in a second language (L2) - this ability will vary widely across annotators. This is a significant challenge for anyone to collect this kind of data, but it does make me wonder more about evaluation other the Mandarin, English, and code switched sets all being together, since we would expect increased differential agreement in L2 settings. The paper would be stronger if it addressed this directly i
Novel and Comprehensive Resource Creation: The construction of SpeechJudge-Data—with diverse TTS models, multilingual support, and dual annotations (intelligibility/naturalness)—fills a key void in large-scale naturalness-focused human feedback corpora for speech synthesis. Rigorous Benchmark Design: SpeechJudge-Eval provides a standardized, high-quality evaluation framework that exposes limitations of existing metrics and AudioLLMs, offering clear direction for future improvements. Effective Re
Limited Analysis of Cross-Lingual and Expressive Speech Performance: While the dataset includes cross-lingual and expressive samples, the paper lacks in-depth analysis of how SpeechJudge-GRM performs across these specific subsets, leaving uncertainty about its generalizability to diverse linguistic and stylistic scenarios. Insufficient Comparison with State-of-the-Art AudioLLM Judges: The evaluation of existing AudioLLMs focuses primarily on zero-shot performance with basic prompts; a more thoro
The collected large-scale human preference dataset can serve as a valuable resource for research on the automatic assessment of synthesized speech quality. The paper verifies its effectiveness both as a benchmark and as training data for building a model to automatically evaluate speech naturalness.
1. The corpus includes both regular and expressive samples, but the evaluation focuses solely on naturalness. Would it be possible to also consider expressiveness as an evaluation dimension, given that it is explicitly represented in the data? 2. It is not clear why “tie” annotations are excluded from both the evaluation subset and the GRM training data. It might also be valuable for the model to recognize when two samples are of similar naturalness (i.e., no perceptible difference), as this cou
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Language Development and Disorders
