Efficient Emotion and Speaker Adaptation in LLM-Based TTS via Characteristic-Specific Partial Fine-Tuning
Tianrui Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Yu Jiang, Ye Ni, Yuheng Lu, Xiaobao Wang, Engsiong Chng, Xie Chen, Longbiao Wang, Jianwu Dang

TL;DR
This paper introduces CSP-FT, a selective fine-tuning method for LLM-based TTS that improves emotion and speaker adaptation efficiency, fidelity, and robustness by tuning only key layers.
Contribution
The paper proposes a characteristic-specific partial fine-tuning strategy that selectively updates the most and least emotion and speaker-related layers, enhancing adaptation with fewer parameters.
Findings
CSP-FT matches or exceeds full fine-tuning in fidelity and intelligibility.
CSP-FT updates only about 8% of parameters, speeding up training.
CSP-FT significantly reduces catastrophic forgetting.
Abstract
While LLM-based TTS models exhibit zero-shot emotion and speaker cloning, their cloning fidelity and pronunciation clarity degrade on unseen domains. Fine-tuning is essential for adaptation, yet uniform approaches overlook specific parameter contributions. Uniform tuning on limited data causes slow training and catastrophic forgetting, leading to degraded pronunciation accuracy. To address this, we propose CSP-FT, a characteristic-specific partial fine-tuning strategy. By dynamically analyzing layer contributions via a weighted sum, we selectively fine-tune only the two layers capturing the most and least emotion and speaker information, maximizing the utility of the former while explicitly strengthening the capacity of the latter. Experiments on a combined corpus of 11 datasets show CSP-FT matches or exceeds the fidelity and intelligibility of full fine-tuning while updating only ~8%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Multi-Head Attention
