VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient   Speaker-Adaptive Text-to-Speech via Autoguidance

Jiheum Yeom; Heeseung Kim; Jooyoung Choi; Che Hyun Lee; Nohil Park,; Sungroh Yoon

arXiv:2409.15759·cs.SD·December 24, 2024

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

Jiheum Yeom, Heeseung Kim, Jooyoung Choi, Che Hyun Lee, Nohil Park,, Sungroh Yoon

PDF

Open Access

TL;DR

VoiceGuider is a novel parameter-efficient TTS system that uses autoguidance to significantly improve out-of-domain speaker adaptation, narrowing the performance gap with full-finetuned models.

Contribution

It introduces autoguidance into parameter-efficient speaker adaptation for TTS, achieving robust out-of-domain performance improvements.

Findings

01

Enhanced out-of-domain speaker adaptation performance

02

Autoguidance strategy outperforms other methods

03

Robust adaptation on extreme out-of-domain data

Abstract

When applying parameter-efficient finetuning via LoRA onto speaker adaptive text-to-speech models, adaptation performance may decline compared to full-finetuned counterparts, especially for out-of-domain speakers. Here, we propose VoiceGuider, a parameter-efficient speaker adaptive text-to-speech system reinforced with autoguidance to enhance the speaker adaptation performance, reducing the gap against full-finetuned models. We carefully explore various ways of strengthening autoguidance, ultimately finding the optimal strategy. VoiceGuider as a result shows robust adaptation performance especially on extreme out-of-domain speech data. We provide audible samples in our demo page.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing