Efficient Emotion and Speaker Adaptation in LLM-Based TTS via Characteristic-Specific Partial Fine-Tuning

Tianrui Wang; Meng Ge; Cheng Gong; Chunyu Qiang; Haoyu Wang; Zikang Huang; Yu Jiang; Ye Ni; Yuheng Lu; Xiaobao Wang; Engsiong Chng; Xie Chen; Longbiao Wang; Jianwu Dang

arXiv:2501.14273·eess.AS·March 9, 2026

Efficient Emotion and Speaker Adaptation in LLM-Based TTS via Characteristic-Specific Partial Fine-Tuning

Tianrui Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Yu Jiang, Ye Ni, Yuheng Lu, Xiaobao Wang, Engsiong Chng, Xie Chen, Longbiao Wang, Jianwu Dang

PDF

Open Access

TL;DR

This paper introduces CSP-FT, a selective fine-tuning method for LLM-based TTS that improves emotion and speaker adaptation efficiency, fidelity, and robustness by tuning only key layers.

Contribution

The paper proposes a characteristic-specific partial fine-tuning strategy that selectively updates the most and least emotion and speaker-related layers, enhancing adaptation with fewer parameters.

Findings

01

CSP-FT matches or exceeds full fine-tuning in fidelity and intelligibility.

02

CSP-FT updates only about 8% of parameters, speeding up training.

03

CSP-FT significantly reduces catastrophic forgetting.

Abstract

While LLM-based TTS models exhibit zero-shot emotion and speaker cloning, their cloning fidelity and pronunciation clarity degrade on unseen domains. Fine-tuning is essential for adaptation, yet uniform approaches overlook specific parameter contributions. Uniform tuning on limited data causes slow training and catastrophic forgetting, leading to degraded pronunciation accuracy. To address this, we propose CSP-FT, a characteristic-specific partial fine-tuning strategy. By dynamically analyzing layer contributions via a weighted sum, we selectively fine-tune only the two layers capturing the most and least emotion and speaker information, maximizing the utility of the former while explicitly strengthening the capacity of the latter. Experiments on a combined corpus of 11 datasets show CSP-FT matches or exceeds the fidelity and intelligibility of full fine-tuning while updating only ~8%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Multi-Head Attention