KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction
Kangxiang Xia, Xinfa Zhu, Jixun Yao, Wenjie Tian, Wenhao Li, Lei Xie

TL;DR
KALL-E is a new autoregressive TTS model that predicts continuous speech distributions directly from text, eliminating diffusion components and enabling high-quality, adaptable speech synthesis from minimal data.
Contribution
It introduces a flow-VAE for continuous speech representation and trains an AR Transformer to predict speech distributions, advancing TTS by removing discrete tokens and diffusion methods.
Findings
Achieves superior speech quality in synthesis.
Can adapt to new speakers with a single sample.
Operates without diffusion-based components.
Abstract
We introduce KALL-E, a novel autoregressive (AR) language model for text-to-speech (TTS) synthesis that operates by predicting the next distribution of continuous speech frames. Unlike existing methods, KALL-E directly models the continuous speech distribution conditioned on text, eliminating the need for any diffusion-based components. Specifically, we utilize a Flow-VAE to extract a continuous latent speech representation from waveforms, instead of relying on discrete speech tokens. A single AR Transformer is then trained to predict these continuous speech distributions from text, optimizing a Kullback-Leibler divergence loss as its objective. Experimental results demonstrate that KALL-E achieves superior speech synthesis quality and can even adapt to a target speaker from just a single sample. Importantly, KALL-E provides a more direct and effective approach for utilizing continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsDilated Causal Convolution · Mixture of Logistic Distributions · WaveNet · Normalizing Flows · WaveVAE
