D-AR: Diffusion via Autoregressive Models

Ziteng Gao; Mike Zheng Shou

arXiv:2505.23660·cs.CV·May 30, 2025

D-AR: Diffusion via Autoregressive Models

Ziteng Gao, Mike Zheng Shou

PDF

1 Repo 3 Reviews

TL;DR

D-AR introduces a novel approach to image diffusion by modeling it as an autoregressive process on discrete tokens, enabling efficient, sequential image generation with properties like consistent previews and zero-shot layout control.

Contribution

The paper proposes recasting image diffusion as an autoregressive token prediction task, leveraging standard next-token prediction without modifications, and demonstrates its effectiveness on ImageNet.

Findings

01

Achieves 2.09 FID on ImageNet with a 775M Llama backbone.

02

Supports consistent previews with partial token generation.

03

Enables zero-shot layout-controlled image synthesis.

Abstract

This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

(1) Its sequential diffusion tokenizer naturally imposes a coarse-to-fine token ordering aligned with diffusion denoising steps, which is highly suitable for autoregressive modeling. (2) The framework supports streaming, consistent previews during generation at no extra cost by leveraging diffusion’s ability to jump-estimate final images from partial token sequences. (3) It enables zero-shot layout-controlled synthesis by fixing prefix tokens, all without finetuning. (4) D-AR benefits from t

Weaknesses

(1) I find D-AR to be a very interesting framework that combines two well-established paradigms. However, the motivation for introducing this new hybrid paradigm is unclear—current diffusion-based or purely autoregressive (AR) approaches already perform very well in image generation. Why is this fused paradigm necessary? What specific problem does it solve? (2) D-AR cannot be easily integrated with state-of-the-art step distillation methods, such as one-step image generation. If step distillatio

Reviewer 02Rating 4Confidence 4

Strengths

1. The idea to project diffusion process into a vanilla autoregressive procedure is interesting. 2. The paper is well-written and easy to follow.

Weaknesses

1. Although the idea is interesting, I am not entirely convinced that this approach truly demonstrates a synergistic effect between autoregressive modeling and diffusion. The autoregressive component still relies on discrete tokens, which are then used as conditions for the diffusion model. Consequently, the information loss inherent in the discrete tokens is propagated into the diffusion process. This is evidenced by the higher rFID of the diffusion tokenizer compared to its continuous VAE coun

Reviewer 03Rating 4Confidence 4

Strengths

1. Novel framework: Integrates diffusion and autoregressive modeling, providing a unified approach that preserves the strengths of both paradigms. 2. Sequential diffusion tokenizer: Generates coarse-to-fine token sequences corresponding to diffusion steps, naturally suited for AR modeling. 3. Preserves vanilla AR architecture: Works with standard decoder-only Transformers without modifications to causal masks, attention, or training schemes. Supports streaming pixel decoding, consistent previews

Weaknesses

1. Dataset and task limitation: Experiments are only on ImageNet 256×256 class-conditional generation. High-resolution images, other datasets (COCO, LSUN, FFHQ), or tasks (image repair, image edit) are not tested. 2. Model complexity and inference speed: The sequential diffusion tokenizer adds 300M parameters, and AR generation requires predicting 256 tokens sequentially. Although KV caching and token grouping mitigate some overhead, more tokens still increase generation latency compared to smal

Code & Models

Repositories

showlab/d-ar
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDiffusion · LLaMA