Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation

Shengfan Shen; Di Wu; Xingchen Song; Dinghao Zhou; Liumeng Xue; Meng Meng; Jian Luan; Shuai Wang

arXiv:2603.24430·cs.SD·March 26, 2026

Iterate to Differentiate: Enhancing Discriminability and Reliability in Zero-Shot TTS Evaluation

Shengfan Shen, Di Wu, Xingchen Song, Dinghao Zhou, Liumeng Xue, Meng Meng, Jian Luan, Shuai Wang

PDF

Open Access

TL;DR

This paper introduces I2D, an iterative evaluation framework for zero-shot TTS that enhances discriminability and reliability by leveraging models' resilience to iterative synthesis, aligning better with human judgments.

Contribution

The paper proposes a novel iterative evaluation method that improves zero-shot TTS assessment by amplifying performance differences and correlating more closely with human preferences.

Findings

01

I2D increases system-level SRCC from 0.118 to 0.464.

02

I2D improves evaluation consistency across multiple languages and datasets.

03

I2D better distinguishes high-quality TTS models through iterative robustness analysis.

Abstract

Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an evaluation framework that recursively synthesizes speech using the model's own outputs as references. Higher-quality models exhibit greater resilience to the distributional shift induced by iterative synthesis, resulting in slower performance degradation. I2D exploits this differential degradation to amplify performance gaps and reveal robustness. By aggregating objective metrics across iterations, I2D improves discriminability and alignment with human judgments, increasing system-level SRCC from 0.118 to 0.464 for UTMOSv2. Experiments on 11 models across Chinese, English, and emotion datasets demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Hate Speech and Cyberbullying Detection · Topic Modeling