KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction

Kangxiang Xia; Xinfa Zhu; Jixun Yao; Wenjie Tian; Wenhao Li; Lei Xie

arXiv:2412.16846·eess.AS·September 18, 2025

KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction

Kangxiang Xia, Xinfa Zhu, Jixun Yao, Wenjie Tian, Wenhao Li, Lei Xie

PDF

Open Access 1 Models

TL;DR

KALL-E is a new autoregressive TTS model that predicts continuous speech distributions directly from text, eliminating diffusion components and enabling high-quality, adaptable speech synthesis from minimal data.

Contribution

It introduces a flow-VAE for continuous speech representation and trains an AR Transformer to predict speech distributions, advancing TTS by removing discrete tokens and diffusion methods.

Findings

01

Achieves superior speech quality in synthesis.

02

Can adapt to new speakers with a single sample.

03

Operates without diffusion-based components.

Abstract

We introduce KALL-E, a novel autoregressive (AR) language model for text-to-speech (TTS) synthesis that operates by predicting the next distribution of continuous speech frames. Unlike existing methods, KALL-E directly models the continuous speech distribution conditioned on text, eliminating the need for any diffusion-based components. Specifically, we utilize a Flow-VAE to extract a continuous latent speech representation from waveforms, instead of relying on discrete speech tokens. A single AR Transformer is then trained to predict these continuous speech distributions from text, optimizing a Kullback-Leibler divergence loss as its objective. Experimental results demonstrate that KALL-E achieves superior speech synthesis quality and can even adapt to a target speaker from just a single sample. Importantly, KALL-E provides a more direct and effective approach for utilizing continuous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
kxxia/KALL-E
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsDilated Causal Convolution · Mixture of Logistic Distributions · WaveNet · Normalizing Flows · WaveVAE