When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Anupam Purwar; Aditya Choudhary

arXiv:2603.10904·cs.SD·March 12, 2026

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Anupam Purwar, Aditya Choudhary

PDF

Open Access

TL;DR

This paper investigates how data diversity and mixed training influence the effectiveness of fine-tuning large language models for neural text-to-speech, demonstrating that LoRA fine-tuning enhances voice quality and speaker fidelity.

Contribution

It shows that LoRA fine-tuning improves speech quality and speaker adaptation in LLM-based TTS systems, especially with diverse training data, outperforming the base model.

Findings

01

LoRA fine-tuning increases DNS-MOS scores by up to 0.42 points.

02

Speaker fidelity improves with consistent voice similarity gains.

03

Signal-to-noise ratio improves by up to 34% with diverse data.

Abstract

Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio SNR in voice cloning task. Across multiple speakers LoRA finetuning consistently outperforms the non-finetuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly with DNS-MOS gains of up to 0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers with consistent increases in voice similarity indicating that LoRA effectively adapts speaker identity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research