Next Tokens Denoising for Speech Synthesis

Yanqing Liu; Ruiqing Xue; Chong Zhang; Yufei Liu; Gang Wang; Bohan Li; Yao Qian; Lei He; Shujie Liu; Sheng Zhao

arXiv:2507.22746·cs.SD·August 4, 2025

Next Tokens Denoising for Speech Synthesis

Yanqing Liu, Ruiqing Xue, Chong Zhang, Yufei Liu, Gang Wang, Bohan Li, Yao Qian, Lei He, Shujie Liu, Sheng Zhao

PDF

Open Access

TL;DR

The paper introduces Dragon-FM, a novel TTS model combining autoregressive and flow-matching techniques to enable fast, coherent, and high-quality speech synthesis, especially suited for long-form content like podcasts.

Contribution

Dragon-FM unifies autoregressive and flow-matching methods for speech synthesis, enabling efficient chunk-wise processing with global coherence and bidirectional context utilization.

Findings

01

Efficient processing of 48 kHz audio at 12.5 tokens/sec.

02

High-quality zero-shot podcast generation demonstrated.

03

Effective integration of continuous and discrete feature modeling.

Abstract

While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact rate of 12.5 tokens per second. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Thus, the model leverages KV-cache across chunks and utilizes bidirectional context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Recommender Systems and Techniques