DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang,, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai

TL;DR
DART introduces a transformer-based, non-Markovian model that unifies autoregressive and diffusion methods for scalable, efficient, and high-quality text-to-image generation without relying on image quantization.
Contribution
It proposes a novel non-Markovian framework that combines autoregressive and diffusion models, enabling more effective image modeling and training with both text and image data.
Findings
Competitive performance on class-conditioned and text-to-image tasks
Sets new benchmarks for scalable image synthesis
Avoids image quantization for better image quality
Abstract
Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to the input. We argue that the Markovian property limits the model's ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model that has the same architecture as standard language models. DART does not rely on image quantization, which enables more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
MethodsDifficulty-Aware Rejection Tuning · Diffusion
