Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator

Hualei Wang; Na Li; Chuke Wang; Shu Wu; Zhifeng Li; Dong Yu

arXiv:2510.20210·cs.SD·October 24, 2025

Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator

Hualei Wang, Na Li, Chuke Wang, Shu Wu, Zhifeng Li, Dong Yu

PDF

Open Access

TL;DR

Vox-Evaluator is a multi-level assessment tool that improves zero-shot TTS stability and fidelity by identifying errors, guiding corrections, and aligning preferences, leading to more natural and accurate speech synthesis.

Contribution

This paper introduces Vox-Evaluator, a novel multi-level evaluator that detects errors, guides speech correction, and improves preference alignment in zero-shot TTS systems.

Findings

01

Vox-Evaluator effectively identifies erroneous speech segments.

02

The correction mechanism improves speech stability and quality.

03

Preference alignment reduces synthesis errors.

Abstract

Recent advances in zero-shot text-to-speech (TTS), driven by language models, diffusion models and masked generation, have achieved impressive naturalness in speech synthesis. Nevertheless, stability and fidelity remain key challenges, manifesting as mispronunciations, audible noise, and quality degradation. To address these issues, we introduce Vox-Evaluator, a multi-level evaluator designed to guide the correction of erroneous speech segments and preference alignment for TTS systems. It is capable of identifying the temporal boundaries of erroneous segments and providing a holistic quality assessment of the generated speech. Specifically, to refine erroneous segments and enhance the robustness of the zero-shot TTS model, we propose to automatically identify acoustic errors with the evaluator, mask the erroneous segments, and finally regenerate speech conditioning on the correct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing