VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based   Personalized Text-to-Speech

Heeseung Kim; Sang-gil Lee; Jiheum Yeom; Che Hyun Lee; Sungwon Kim,; Sungroh Yoon

arXiv:2408.14739·cs.SD·August 29, 2024

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim,, Sungroh Yoon

PDF

Open Access

TL;DR

VoiceTailor is a lightweight, parameter-efficient TTS system that adapts to individual speakers using a personalized adapter integrated into a diffusion-based model, requiring only 0.25% of parameters for effective speaker adaptation.

Contribution

It introduces a novel adapter-based approach for speaker adaptation in diffusion TTS models, identifying pivotal modules and utilizing LoRA for efficient personalization.

Findings

01

Achieves comparable speaker adaptation performance with only 0.25% of parameters.

02

Demonstrates robustness across diverse real-world speakers.

03

Utilizes guidance techniques to enhance speaker information transfer.

Abstract

We propose VoiceTailor, a parameter-efficient speaker-adaptive text-to-speech (TTS) system, by equipping a pre-trained diffusion-based TTS model with a personalized adapter. VoiceTailor identifies pivotal modules that benefit from the adapter based on a weight change ratio analysis. We utilize Low-Rank Adaptation (LoRA) as a parameter-efficient adaptation method and incorporate the adapter into pivotal modules of the pre-trained diffusion decoder. To achieve powerful adaptation performance with few parameters, we explore various guidance techniques for speaker adaptation and investigate the best strategies to strengthen speaker information. VoiceTailor demonstrates comparable speaker adaptation performance to existing adaptive TTS models by fine-tuning only 0.25\% of the total parameters. VoiceTailor shows strong robustness when adapting to a wide range of real-world speakers, as shown…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems