LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

Detai Xin; Shujie Hu; Chengzuo Yang; Chen Huang; Guoqiao Yu; Guanglu Wan; Xunliang Cai

arXiv:2603.29339·cs.SD·April 1, 2026

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

Detai Xin, Shujie Hu, Chengzuo Yang, Chen Huang, Guoqiao Yu, Guanglu Wan, Xunliang Cai

PDF

1 Repo

TL;DR

LongCat-AudioDiT introduces a high-fidelity, non-autoregressive diffusion TTS model operating directly in waveform latent space, achieving state-of-the-art zero-shot voice cloning without complex pipelines.

Contribution

The paper presents a novel waveform latent space diffusion TTS model with improved inference guidance and training-inference alignment, setting new benchmarks in voice cloning performance.

Findings

01

Achieves SOTA zero-shot voice cloning on Seed benchmark.

02

Outperforms previous models in speaker similarity scores.

03

Validates that higher Wav-VAE reconstruction fidelity does not always improve TTS performance.

Abstract

We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.