Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation
Changjin Han, Seokgi Lee, Gyuhyeon Nam, Gyeongsu Chae

TL;DR
This paper introduces StableForm-TTS, a diffusion-based zero-shot speech synthesis framework that enhances pronunciation stability and naturalness by integrating source-filter theory, addressing mispronunciation issues in existing models.
Contribution
The paper pioneers the use of source-filter theory in diffusion TTS to improve pronunciation robustness and introduces a novel architecture for stable formant generation.
Findings
Outperforms state-of-the-art in pronunciation accuracy and naturalness
Maintains speaker similarity comparable to existing methods
Scales effectively with increased data and model sizes
Abstract
Diffusion models have achieved remarkable success in text-to-speech (TTS), even in zero-shot scenarios. Recent efforts aim to address the trade-off between inference speed and sound quality, often considered the primary drawback of diffusion models. However, we find a critical mispronunciation issue is being overlooked. Our preliminary study reveals the unstable pronunciation resulting from the diffusion process. Based on this observation, we introduce StableForm-TTS, a novel zero-shot speech synthesis framework designed to produce robust pronunciation while maintaining the advantages of diffusion modeling. By pioneering the adoption of source-filter theory in diffusion TTS, we propose an elaborate architecture for stable formant generation. Experimental results on unseen speakers show that our model outperforms the state-of-the-art method in terms of pronunciation accuracy and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
