Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin

TL;DR
This paper introduces Distilled Decoding, a flow matching-based method that enables one or two-step generation in pre-trained autoregressive models, significantly speeding up image and text generation with minimal quality loss.
Contribution
It presents the first method to achieve one-step generation in image autoregressive models using flow matching and distillation, without needing original training data.
Findings
Enables 6.3× speed-up for VAR with acceptable FID increase
Achieves 217.8× speed-up for LlamaGen with minimal quality loss
Reduces text-to-image generation from 256 to 2 steps with minimal FID increase
Abstract
Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step…
Peer Reviews
Decision·ICLR 2025 Poster
The work presents a method that leverages deterministic flow matching to create training data (from an AR model) for a one-step image generation model. When trained on this data this model is a distilled version of the original AR model. The idea of using determinstic flow matching to create the data is novel and seems like a good and innovative candidate idea to achieve this. The paper evaluates the claims on class-to-image generation on ImageNet and compares to simple baselines, achieving acce
A. It seems like the FID increases, although seemingly acceptable in numerical terms, give rise to blurry and artifact-ridden images, many of which don't even preserve the structure of the class they are trying to generate (monkeys without eyes etc.). Also, another thing that undermines my confidence is that no images are shown in the main paper and instead shown in the appendix. At least some examples are shown. B. One big problem from (A) is that, since the paper's main premise is to distill
- The authors proposed a new framework to make few-step AR distillation possible. - The conversion from next image token prediction to next (set of) image token denoising is a very smart design. The method naturally combines the best of worlds in diffusion models / flow matching and autoregressive image modeling. - The performance gain against baseline few-step samplers is huge.
- The authors propose a novel framework enabling few-step autoregressive (AR) distillation. - Converting next-image-token prediction to next-image-token denoising is an ingenious design choice, seamlessly integrating the strengths of both diffusion models and autoregressive image modeling. - The performance improvement over baseline few-step samplers is substantial.
The paper is concisely written with a clear line of thought. The approach of constructing a continuous embedding space in AR and matching it to a Gaussian distribution is particularly interesting. This construction effectively combines the discrete cross-entropy model from AR-based methods with the probability distribution strategies that have proven successful in diffusion models, allowing the concept of consistency distillation from diffusion to be successfully applied within the AR field. The
The paper successfully applies the consistency distillation (CD) technique from diffusion models to the AR field. However, the current results still fall short compared to the pre-trained model.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
