When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS
Anupam Purwar, Aditya Choudhary

TL;DR
This paper investigates how data diversity and mixed training influence the effectiveness of fine-tuning large language models for neural text-to-speech, demonstrating that LoRA fine-tuning enhances voice quality and speaker fidelity.
Contribution
It shows that LoRA fine-tuning improves speech quality and speaker adaptation in LLM-based TTS systems, especially with diverse training data, outperforming the base model.
Findings
LoRA fine-tuning increases DNS-MOS scores by up to 0.42 points.
Speaker fidelity improves with consistent voice similarity gains.
Signal-to-noise ratio improves by up to 34% with diverse data.
Abstract
Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio SNR in voice cloning task. Across multiple speakers LoRA finetuning consistently outperforms the non-finetuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly with DNS-MOS gains of up to 0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers with consistent increases in voice similarity indicating that LoRA effectively adapts speaker identity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research
