LoRP-TTS: Low-Rank Personalized Text-To-Speech

{\L}ukasz Bondaruk; Jakub Kubiak

arXiv:2502.07562·cs.SD·February 12, 2025

LoRP-TTS: Low-Rank Personalized Text-To-Speech

{\L}ukasz Bondaruk, Jakub Kubiak

PDF

Open Access

TL;DR

LoRP-TTS leverages Low-Rank Adaptation to improve zero-shot personalized speech synthesis, enabling high-quality imitation from minimal and noisy spontaneous speech samples, thus advancing diversity in speech corpora.

Contribution

This work introduces the use of Low-Rank Adaptation in TTS to effectively utilize single noisy recordings for speaker adaptation, enhancing realism and diversity.

Findings

01

Speaker similarity improved by up to 30 percentage points

02

Effective use of noisy, spontaneous speech samples as prompts

03

Advances in creating diverse speech datasets

Abstract

Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30 pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis