DART: Denoising Autoregressive Transformer for Scalable Text-to-Image   Generation

Jiatao Gu; Yuyang Wang; Yizhe Zhang; Qihang Zhang; Dinghuai Zhang,; Navdeep Jaitly; Josh Susskind; Shuangfei Zhai

arXiv:2410.08159·cs.CV·January 24, 2025

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Jiatao Gu, Yuyang Wang, Yizhe Zhang, Qihang Zhang, Dinghuai Zhang,, Navdeep Jaitly, Josh Susskind, Shuangfei Zhai

PDF

Open Access

TL;DR

DART introduces a transformer-based, non-Markovian model that unifies autoregressive and diffusion methods for scalable, efficient, and high-quality text-to-image generation without relying on image quantization.

Contribution

It proposes a novel non-Markovian framework that combines autoregressive and diffusion models, enabling more effective image modeling and training with both text and image data.

Findings

01

Competitive performance on class-conditioned and text-to-image tasks

02

Sets new benchmarks for scalable image synthesis

03

Avoids image quantization for better image quality

Abstract

Diffusion models have become the dominant approach for visual generation. They are trained by denoising a Markovian process which gradually adds noise to the input. We argue that the Markovian property limits the model's ability to fully utilize the generation trajectory, leading to inefficiencies during training and inference. In this paper, we propose DART, a transformer-based model that unifies autoregressive (AR) and diffusion within a non-Markovian framework. DART iteratively denoises image patches spatially and spectrally using an AR model that has the same architecture as standard language models. DART does not rely on image quantization, which enables more effective image modeling while maintaining flexibility. Furthermore, DART seamlessly trains with both text and image data in a unified model. Our approach demonstrates competitive performance on class-conditioned and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization

MethodsDifficulty-Aware Rejection Tuning · Diffusion