DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Dongya Jia; Zhuo Chen; Jiawei Chen; Chenpeng Du; Jian Wu; Jian Cong; Xiaobin Zhuang; Chumin Li; Zhen Wei; Yuping Wang; Yuxuan Wang

arXiv:2502.03930·eess.AS·December 9, 2025

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang

PDF

Open Access

TL;DR

DiTAR introduces a novel patch-based autoregressive framework combining diffusion transformers and language models, significantly improving speech generation quality and efficiency with state-of-the-art results in robustness, speaker similarity, and naturalness.

Contribution

The paper presents DiTAR, a new diffusion transformer autoregressive model that enhances continuous speech generation by reducing computational load and improving scalability.

Findings

01

Achieves state-of-the-art zero-shot speech generation performance.

02

Demonstrates superior scalability in extensive analysis.

03

Balances diversity and determinism through temperature control during inference.

Abstract

Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsAttention Is All You Need · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Linear Layer · Multi-Head Attention · Diffusion · Position-Wise Feed-Forward Layer