Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Mikyas T. Desta, Roy Fejgin, Rafael Valle, Jason Li

TL;DR
Koel-TTS is a novel speech synthesis model that improves controllability, speaker similarity, and naturalness by integrating preference alignment and classifier-free guidance, outperforming existing models on key metrics.
Contribution
The paper introduces Koel-TTS, a Transformer-based TTS model that incorporates preference alignment and classifier-free guidance to enhance speech quality and controllability.
Findings
Significantly improves speaker similarity and naturalness.
Outperforms state-of-the-art TTS models on key metrics.
Achieves high-quality synthesis with less training data.
Abstract
While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding
