Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

Enshu Liu; Xuefei Ning; Yu Wang; Zinan Lin

arXiv:2412.17153·cs.CV·October 27, 2025

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper introduces Distilled Decoding, a flow matching-based method that enables one or two-step generation in pre-trained autoregressive models, significantly speeding up image and text generation with minimal quality loss.

Contribution

It presents the first method to achieve one-step generation in image autoregressive models using flow matching and distillation, without needing original training data.

Findings

01

Enables 6.3× speed-up for VAR with acceptable FID increase

02

Achieves 217.8× speed-up for LlamaGen with minimal quality loss

03

Reduces text-to-image generation from 256 to 2 steps with minimal FID increase

Abstract

Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The work presents a method that leverages deterministic flow matching to create training data (from an AR model) for a one-step image generation model. When trained on this data this model is a distilled version of the original AR model. The idea of using determinstic flow matching to create the data is novel and seems like a good and innovative candidate idea to achieve this. The paper evaluates the claims on class-to-image generation on ImageNet and compares to simple baselines, achieving acce

Weaknesses

A. It seems like the FID increases, although seemingly acceptable in numerical terms, give rise to blurry and artifact-ridden images, many of which don't even preserve the structure of the class they are trying to generate (monkeys without eyes etc.). Also, another thing that undermines my confidence is that no images are shown in the main paper and instead shown in the appendix. At least some examples are shown. B. One big problem from (A) is that, since the paper's main premise is to distill

Reviewer 02Rating 6Confidence 4

Strengths

- The authors proposed a new framework to make few-step AR distillation possible. - The conversion from next image token prediction to next (set of) image token denoising is a very smart design. The method naturally combines the best of worlds in diffusion models / flow matching and autoregressive image modeling. - The performance gain against baseline few-step samplers is huge.

Weaknesses

- The authors propose a novel framework enabling few-step autoregressive (AR) distillation. - Converting next-image-token prediction to next-image-token denoising is an ingenious design choice, seamlessly integrating the strengths of both diffusion models and autoregressive image modeling. - The performance improvement over baseline few-step samplers is substantial.

Reviewer 03Rating 8Confidence 4

Strengths

The paper is concisely written with a clear line of thought. The approach of constructing a continuous embedding space in AR and matching it to a Gaussian distribution is particularly interesting. This construction effectively combines the discrete cross-entropy model from AR-based methods with the probability distribution strategies that have proven successful in diffusion models, allowing the concept of consistency distillation from diffusion to be successfully applied within the AR field. The

Weaknesses

The paper successfully applies the consistency distillation (CD) technique from diffusion models to the AR field. However, the current results still fall short compared to the pre-trained model.

Code & Models

Repositories

imagination-research/distilled-decoding
noneOfficial

Models

🤗
microsoft/distilled_decoding
model· ♡ 5
♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings