Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification

Yiyang Zhao; Shuai Wang; Guangzhi Sun; Zehua Chen; Chao Zhang; Mingxing Xu; Thomas Fang Zheng

arXiv:2506.14226·cs.SD·June 18, 2025

Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification

Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

PDF

Open Access

TL;DR

This paper investigates the use of zero-shot text-to-speech systems for data augmentation to improve short-utterance speaker verification, demonstrating significant EER reductions without retraining.

Contribution

It is the first study to evaluate zero-shot TTS for test-time data augmentation in speaker verification, showing practical benefits and limitations.

Findings

01

10%-16% relative EER reduction across durations

02

Synthetic speech benefits are more pronounced with shorter real speech

03

Longer synthetic speech does not always improve verification accuracy

Abstract

Short-utterance speaker verification presents significant challenges due to the limited information in brief speech segments, which can undermine accuracy and reliability. Recently, zero-shot text-to-speech (ZS-TTS) systems have made considerable progress in preserving speaker identity. In this study, we explore, for the first time, the use of ZS-TTS systems for test-time data augmentation for speaker verification. We evaluate three state-of-the-art pre-trained ZS-TTS systems, NatureSpeech 3, CosyVoice, and MaskGCT, on the VoxCeleb 1 dataset. Our experimental results show that combining real and synthetic speech samples leads to 10%-16% relative equal error rate (EER) reductions across all durations, with particularly notable improvements for short utterances, all without retraining any existing systems. However, our analysis reveals that longer synthetic speech does not yield the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis