ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation
Ruibo Fu, Xin Qi, Zhengqi Wen, Jianhua Tao, Tao Wang, Chunyu Qiang,, Zhiyong Wang, Yi Lu, Xiaopeng Wang, Shuchen Shi, Yukun Liu, Xuefei Liu, Shuai, Zhang

TL;DR
This paper introduces ASRRL, a reinforcement learning-based method for improving speaker adaptation in text-to-speech systems, effectively enhancing speaker similarity and speech quality with limited reference data.
Contribution
It is the first to apply reinforcement learning to optimize speaker embeddings in TTS adaptation, introducing two action strategies for different reference speech scenarios.
Findings
Outperforms traditional fine-tuning in speaker similarity
Achieves higher speech quality and intelligibility
Demonstrates strong generalization on LibriTTS and VCTK datasets
Abstract
Speaker adaptation, which involves cloning voices from unseen speakers in the Text-to-Speech task, has garnered significant interest due to its numerous applications in multi-media fields. Despite recent advancements, existing methods often struggle with inadequate speaker representation accuracy and overfitting, particularly in limited reference speeches scenarios. To address these challenges, we propose an Agile Speaker Representation Reinforcement Learning strategy to enhance speaker similarity in speaker adaptation tasks. ASRRL is the first work to apply reinforcement learning to improve the modeling accuracy of speaker embeddings in speaker adaptation, addressing the challenge of decoupling voice content and timbre. Our approach introduces two action strategies tailored to different reference speeches scenarios. In the single-sentence scenario, a knowledge-oriented optimal routine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
