TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems
Christoph Minixhofer, Ondrej Klejch, Peter Bell

TL;DR
This paper introduces TTSDS2, a new benchmark and resources for evaluating TTS systems, demonstrating improved correlation with subjective quality assessments across multiple languages and domains.
Contribution
The paper presents TTSDS2, an enhanced evaluation metric and comprehensive resources, including datasets and benchmarks, for assessing the quality of TTS systems more reliably.
Findings
TTSDS2 correlates highly with subjective scores across all tested domains.
It outperforms 15 other metrics in robustness and reliability.
Resources include a large dataset and a multilingual benchmark for ongoing evaluation.
Abstract
Evaluation of Text to Speech (TTS) systems is challenging and resource-intensive. Subjective metrics such as Mean Opinion Score (MOS) are not easily comparable between works. Objective metrics are frequently used, but rarely validated against subjective ones. Both kinds of metrics are challenged by recent TTS systems capable of producing synthetic speech indistinguishable from real speech. In this work, we introduce Text to Speech Distribution Score 2 (TTSDS2), a more robust and improved version of TTSDS. Across a range of domains and languages, it is the only one out of 16 compared metrics to correlate with a Spearman correlation above 0.50 for every domain and subjective score evaluated. We also release a range of resources for evaluating synthetic speech close to real speech: A dataset with over 11,000 subjective opinion score ratings; a pipeline for continually recreating a…
Peer Reviews
Decision·ICLR 2026 Oral
- Comprehensive comparison against numerous evaluation metrics. - Evaluation across 20 recent TTS systems and 14 languages. - Diverse and well-structured test sets covering clean, noisy, wild, and children’s speech domains. - Offers a promising, scalable solution for objective TTS evaluation. - Shows clear and consistent alignment with subjective (human) evaluation results.
- The overall methodology closely resembles the original TTSDS, with limited conceptual novelty. - The paper averages multiple perceptual factor scores into a single TTSDS2 score but provides no justification for using equal weighting. - The selection criteria for the feature sets within each factor are not clearly explained or motivated.
* The problem is well-motivated: subjective evaluation is expensive and rapidly becoming insufficient as systems approach human quality. The factorized distributional approach is intuitive and gives interpretable sub-scores. * The empirical evaluation is broad, covering 20 recent models and multiple domains (clean, noisy, wild, children). * Demonstrating consistent correlation >0.5 with human ratings across all domains is compelling. * Benchmark and pipeline release are valuable to the commun
* A more principled justification or ablation is warranted as the choice of feature set can appears somewhat tuned for correlation. * Figure 3 is not referenced anywhere in text and the caption is vague. * The multilingual evaluation lacks validation with human preference tests (e.g., CMOS / MUSHRA), making it difficult to verify whether TTSDS2 maintains reliability beyond English.
- **Comprehensive and Diverse Evaluation:** The paper introduces a test set with human annotations covering four diverse domains—read speech, noisy speech, YouTube speech, and children’s speech. Both the dataset and the evaluation are publicly released, enabling future reproducibility and comparison. The study also benchmarks a wide range of baseline TTS models and compares against multiple existing quality metrics, ensuring a thorough evaluation. - **Strong Correlation with Human
- **Poor Writing Quality and Structure:** While the main contribution of the paper is clear, the overall writing is repetitive and poorly structured, particularly in Section 1. - **Questionable Gaussian Assumption in Metric Design:** TTSDS2 assumes that the speech embeddings follow a multivariate Gaussian distribution when computing the Wasserstein distance. However, this assumption may not hold in practice, and the authors provide no empirical justification for it. A mor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
