Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation
Shengfan Shen, Di Wu, Xingchen Song, Dinghao Zhou, Liumeng Xue, Meng Meng, Jian Luan, Shuai Wang

TL;DR
This paper introduces I2D, an iterative evaluation framework for zero-shot TTS that enhances discriminability and reliability by leveraging models' resilience to iterative synthesis, aligning better with human judgments.
Contribution
The paper proposes a novel iterative evaluation method that improves zero-shot TTS assessment by amplifying performance differences and correlating more closely with human preferences.
Findings
I2D increases system-level SRCC from 0.118 to 0.464.
I2D improves evaluation consistency across multiple languages and datasets.
I2D better distinguishes high-quality TTS models through iterative robustness analysis.
Abstract
Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model's own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Hate Speech and Cyberbullying Detection · Topic Modeling
