Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Shehzeen Hussain; Paarth Neekhara; Xuesong Yang; Edresson Casanova; Subhankar Ghosh; Mikyas T. Desta; Roy Fejgin; Rafael Valle; Jason Li

arXiv:2502.05236·cs.SD·July 24, 2025

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li

PDF

Open Access 1 Models 1 Video

TL;DR

Koel-TTS is a novel speech synthesis model that improves controllability, speaker similarity, and naturalness by integrating preference alignment and classifier-free guidance, outperforming existing models on key metrics.

Contribution

The paper introduces Koel-TTS, a Transformer-based TTS model that incorporates preference alignment and classifier-free guidance to enhance speech quality and controllability.

Findings

01

Significantly improves speaker similarity and naturalness.

02

Outperforms state-of-the-art TTS models on key metrics.

03

Achieves high-quality synthesis with less training data.

Abstract

While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nvidia/magpie_tts_multilingual_357m
model· 1.3k dl· ♡ 111
1.3k dl♡ 111

Videos

Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding