CLEAR: Continuous Latent Autoregressive Modeling for High-quality and Low-latency Speech Synthesis

Chun Yat Wu; Jiajun Deng; Guinan Li; Qiuqiang Kong; Simon Lui

arXiv:2508.19098·eess.AS·August 27, 2025

CLEAR: Continuous Latent Autoregressive Modeling for High-quality and Low-latency Speech Synthesis

Chun Yat Wu, Jiajun Deng, Guinan Li, Qiuqiang Kong, Simon Lui

PDF

TL;DR

CLEAR introduces a novel continuous latent autoregressive framework for zero-shot speech synthesis, achieving high-quality, low-latency, and streaming capable TTS by modeling continuous audio representations directly.

Contribution

The paper proposes a unified zero-shot TTS framework that models continuous audio representations with an enhanced variational autoencoder and a lightweight flow head, reducing inference latency and improving synthesis quality.

Findings

01

CLEAR achieves state-of-the-art results on LibriSpeech with 1.88% WER.

02

It has a low real-time factor of 0.29, enabling fast synthesis.

03

Supports streaming synthesis with 96ms delay.

Abstract

Autoregressive (AR) language models have emerged as powerful solutions for zero-shot text-to-speech (TTS) synthesis, capable of generating natural speech from a few seconds of audio prompts. However, conventional AR-based TTS systems relying on discrete audio tokens face the challenge of lossy compression during tokenization, requiring longer discrete token sequences to capture the same information as continuous ones, which adds inference latency and complicates AR modeling. To address this challenge, this paper proposes the Continuous Latent Autoregressive model (CLEAR), a unified zero-shot TTS framework that directly models continuous audio representations. More specifically, CLEAR introduces an enhanced variational autoencoder with shortcut connections, which achieves a high compression ratio to map waveforms into compact continuous latents. A lightweight MLP-based rectified flow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.