Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody   Prompting

Wooseok Han; Minki Kang; Changhun Kim; Eunho Yang

arXiv:2412.20155·cs.SD·December 31, 2024

Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting

Wooseok Han, Minki Kang, Changhun Kim, Eunho Yang

PDF

Open Access

TL;DR

Stable-TTS is a novel speaker-adaptive TTS framework that maintains prosody consistency and speaker identity using high-quality prior samples and a prior-preservation loss, effective even with limited or noisy target data.

Contribution

It introduces a new approach combining prior samples and a prior-preservation loss to improve stability and quality in speaker-adaptive TTS.

Findings

01

Effective with limited target speech samples

02

Maintains prosody and speaker identity

03

Robust to noisy target data

Abstract

Speaker-adaptive Text-to-Speech (TTS) synthesis has attracted considerable attention due to its broad range of applications, such as personalized voice assistant services. While several approaches have been proposed, they often exhibit high sensitivity to either the quantity or the quality of target speech samples. To address these limitations, we introduce Stable-TTS, a novel speaker-adaptive TTS framework that leverages a small subset of a high-quality pre-training dataset, referred to as prior samples. Specifically, Stable-TTS achieves prosody consistency by leveraging the high-quality prosody of prior samples, while effectively capturing the timbre of the target speaker. Additionally, it employs a prior-preservation loss during fine-tuning to maintain the synthesis ability for prior samples to prevent overfitting on target samples. Extensive experiments demonstrate the effectiveness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need