TL;DR
D-AR introduces a novel approach to image diffusion by modeling it as an autoregressive process on discrete tokens, enabling efficient, sequential image generation with properties like consistent previews and zero-shot layout control.
Contribution
The paper proposes recasting image diffusion as an autoregressive token prediction task, leveraging standard next-token prediction without modifications, and demonstrates its effectiveness on ImageNet.
Findings
Achieves 2.09 FID on ImageNet with a 775M Llama backbone.
Supports consistent previews with partial token generation.
Enables zero-shot layout-controlled image synthesis.
Abstract
This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens,…
Peer Reviews
Decision·ICLR 2026 Poster
(1) Its sequential diffusion tokenizer naturally imposes a coarse-to-fine token ordering aligned with diffusion denoising steps, which is highly suitable for autoregressive modeling. (2) The framework supports streaming, consistent previews during generation at no extra cost by leveraging diffusion’s ability to jump-estimate final images from partial token sequences. (3) It enables zero-shot layout-controlled synthesis by fixing prefix tokens, all without finetuning. (4) D-AR benefits from t
(1) I find D-AR to be a very interesting framework that combines two well-established paradigms. However, the motivation for introducing this new hybrid paradigm is unclear—current diffusion-based or purely autoregressive (AR) approaches already perform very well in image generation. Why is this fused paradigm necessary? What specific problem does it solve? (2) D-AR cannot be easily integrated with state-of-the-art step distillation methods, such as one-step image generation. If step distillatio
1. The idea to project diffusion process into a vanilla autoregressive procedure is interesting. 2. The paper is well-written and easy to follow.
1. Although the idea is interesting, I am not entirely convinced that this approach truly demonstrates a synergistic effect between autoregressive modeling and diffusion. The autoregressive component still relies on discrete tokens, which are then used as conditions for the diffusion model. Consequently, the information loss inherent in the discrete tokens is propagated into the diffusion process. This is evidenced by the higher rFID of the diffusion tokenizer compared to its continuous VAE coun
1. Novel framework: Integrates diffusion and autoregressive modeling, providing a unified approach that preserves the strengths of both paradigms. 2. Sequential diffusion tokenizer: Generates coarse-to-fine token sequences corresponding to diffusion steps, naturally suited for AR modeling. 3. Preserves vanilla AR architecture: Works with standard decoder-only Transformers without modifications to causal masks, attention, or training schemes. Supports streaming pixel decoding, consistent previews
1. Dataset and task limitation: Experiments are only on ImageNet 256×256 class-conditional generation. High-resolution images, other datasets (COCO, LSUN, FFHQ), or tasks (image repair, image edit) are not tested. 2. Model complexity and inference speed: The sequential diffusion tokenizer adds 300M parameters, and AR generation requires predicting 256 tokens sequentially. Although KV caching and token grouping mitigate some overhead, more tokens still increase generation latency compared to smal
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion · LLaMA
