Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via   Stable Formant Generation

Changjin Han; Seokgi Lee; Gyuhyeon Nam; Gyeongsu Chae

arXiv:2409.09311·eess.AS·January 14, 2025

Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation

Changjin Han, Seokgi Lee, Gyuhyeon Nam, Gyeongsu Chae

PDF

Open Access

TL;DR

This paper introduces StableForm-TTS, a diffusion-based zero-shot speech synthesis framework that enhances pronunciation stability and naturalness by integrating source-filter theory, addressing mispronunciation issues in existing models.

Contribution

The paper pioneers the use of source-filter theory in diffusion TTS to improve pronunciation robustness and introduces a novel architecture for stable formant generation.

Findings

01

Outperforms state-of-the-art in pronunciation accuracy and naturalness

02

Maintains speaker similarity comparable to existing methods

03

Scales effectively with increased data and model sizes

Abstract

Diffusion models have achieved remarkable success in text-to-speech (TTS), even in zero-shot scenarios. Recent efforts aim to address the trade-off between inference speed and sound quality, often considered the primary drawback of diffusion models. However, we find a critical mispronunciation issue is being overlooked. Our preliminary study reveals the unstable pronunciation resulting from the diffusion process. Based on this observation, we introduce StableForm-TTS, a novel zero-shot speech synthesis framework designed to produce robust pronunciation while maintaining the advantages of diffusion modeling. By pioneering the adoption of source-filter theory in diffusion TTS, we propose an elaborate architecture for stable formant generation. Experimental results on unseen speakers show that our model outperforms the state-of-the-art method in terms of pronunciation accuracy and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings