ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for   Text-to-Speech Speaker Adaptation

Ruibo Fu; Xin Qi; Zhengqi Wen; Jianhua Tao; Tao Wang; Chunyu Qiang,; Zhiyong Wang; Yi Lu; Xiaopeng Wang; Shuchen Shi; Yukun Liu; Xuefei Liu; Shuai; Zhang

arXiv:2407.05421·eess.AS·July 9, 2024

ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation

Ruibo Fu, Xin Qi, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang,, Zhiyong Wang, Yi Lu, Xiaopeng Wang, Shuchen Shi, Yukun Liu, Xuefei Liu, Shuai, Zhang

PDF

Open Access

TL;DR

This paper introduces ASRRL, a reinforcement learning-based method for improving speaker adaptation in text-to-speech systems, effectively enhancing speaker similarity and speech quality with limited reference data.

Contribution

It is the first to apply reinforcement learning to optimize speaker embeddings in TTS adaptation, introducing two action strategies for different reference speech scenarios.

Findings

01

Outperforms traditional fine-tuning in speaker similarity

02

Achieves higher speech quality and intelligibility

03

Demonstrates strong generalization on LibriTTS and VCTK datasets

Abstract

Speaker adaptation, which involves cloning voices from unseen speakers in the Text-to-Speech task, has garnered significant interest due to its numerous applications in multi-media fields. Despite recent advancements, existing methods often struggle with inadequate speaker representation accuracy and overfitting, particularly in limited reference speeches scenarios. To address these challenges, we propose an Agile Speaker Representation Reinforcement Learning strategy to enhance speaker similarity in speaker adaptation tasks. ASRRL is the first work to apply reinforcement learning to improve the modeling accuracy of speaker embeddings in speaker adaptation, addressing the challenge of decoupling voice content and timbre. Our approach introduces two action strategies tailored to different reference speeches scenarios. In the single-sentence scenario, a knowledge-oriented optimal routine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing