ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Jinyi Hu; Shengding Hu; Yuxuan Song; Yufei Huang; Mingxuan Wang; Hao Zhou; Zhiyuan Liu; Wei-Ying Ma; Maosong Sun

arXiv:2412.07720·cs.CV·January 30, 2026

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun

PDF

Open Access 1 Repo

TL;DR

ACDiT introduces a novel model that combines autoregressive and diffusion methods for continuous visual data, enabling flexible generation and transfer learning across tasks.

Contribution

The paper proposes ACDiT, a new autoregressive diffusion transformer that interpolates between token-wise autoregression and full-sequence diffusion for improved visual generation.

Findings

01

ACDiT outperforms autoregressive baselines in visual generation tasks.

02

Pretrained ACDiT can be transferred to visual understanding tasks.

03

The model effectively balances autoregressive and diffusion processes for long-horizon generation.

Abstract

Autoregressive and diffusion models have achieved remarkable progress in language models and visual generation, respectively. We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as applying a specially designed Skip-Causal Attention Mask on the standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/acdit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing