CLEAR: Continuous Latent Autoregressive Modeling for High-quality and Low-latency Speech Synthesis
Chun Yat Wu, Jiajun Deng, Guinan Li, Qiuqiang Kong, Simon Lui

TL;DR
CLEAR introduces a novel continuous latent autoregressive framework for zero-shot speech synthesis, achieving high-quality, low-latency, and streaming capable TTS by modeling continuous audio representations directly.
Contribution
The paper proposes a unified zero-shot TTS framework that models continuous audio representations with an enhanced variational autoencoder and a lightweight flow head, reducing inference latency and improving synthesis quality.
Findings
CLEAR achieves state-of-the-art results on LibriSpeech with 1.88% WER.
It has a low real-time factor of 0.29, enabling fast synthesis.
Supports streaming synthesis with 96ms delay.
Abstract
Autoregressive (AR) language models have emerged as powerful solutions for zero-shot text-to-speech (TTS) synthesis, capable of generating natural speech from a few seconds of audio prompts. However, conventional AR-based TTS systems relying on discrete audio tokens face the challenge of lossy compression during tokenization, requiring longer discrete token sequences to capture the same information as continuous ones, which adds inference latency and complicates AR modeling. To address this challenge, this paper proposes the Continuous Latent Autoregressive model (CLEAR), a unified zero-shot TTS framework that directly models continuous audio representations. More specifically, CLEAR introduces an enhanced variational autoencoder with shortcut connections, which achieves a high compression ratio to map waveforms into compact continuous latents. A lightweight MLP-based rectified flow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
