Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling
Feihong Yan, Peiru Wang, Yao Zhu, Kaiyu Pang, Qingyan Wei, Huiqi Li, Linfeng Zhang

TL;DR
This paper introduces a two-stage sampling strategy called Generation then Reconstruction (GtR) for masked autoregressive models, significantly accelerating image generation while maintaining quality by decomposing generation into structure and detail stages.
Contribution
The paper proposes a novel hierarchical sampling method GtR that accelerates masked autoregressive models without retraining, and introduces Frequency-Weighted Token Selection to focus computation on semantically rich image details.
Findings
Achieves 3.72x speedup on MAR-H with comparable quality.
Maintains low FID and high IS scores during acceleration.
Outperforms existing acceleration methods across models and tasks.
Abstract
Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly.…
Peer Reviews
Decision·ICLR 2026 Poster
- The authors present a comprehensive review of the literature in the context of visual generation, and identify shortcomings, proposing a sound method to address these limitations - The two stage generation pipeline is a fairly novel contribution, utilizing ideas from previous works on token ordering - The inference time speedups in terms of real time latency is promising - The authors perform a detailed set of ablations, explaining their design choices
- The key hypothesis of this method is that the checkerboard-like pattern would lead to good global context and therefore more coherent generation. However, it is possible for images to have structures that extend to larger regions of the image, thus still potentially producing inconsistencies in generation. A deeper discussion on this point would be helpful - Adding on to the previous point, since the checkerboard pattern followed by reconstruction is meant to produce more coherent generations,
1. The proposed sampling strategy is effective and intuitively easy to understand. 2. The paper is clearly written. 3. The performance in ImageNet-256x256 is impressive.
1. Limited Generalizability: The method's generalizability is a significant concern. While it performs well on ImageNet, its performance drops considerably on the T2I task (as shown in Table 2). This suggests that the proposed sampling strategy might be an overfitted, heavily-tuned solution for the ImageNet dataset and may not generalize well to other datasets or tasks. Moreover, the proposed strategy's applicability seems limited, as it is designed specifically for the MAR. 2. The novelty of t
- **Simple, practical idea with strong engineering value.** Splitting generation into “creation” (structure) and “reconstruction” (detail) is intuitive and requires no model retraining, which makes it immediately usable in deployed MAR systems. - **Clear empirical speed/quality benefit.** The reported ≈3.72× acceleration on MAR-H while maintaining generation quality is compelling evidence of practical value. - **Ablations and comparisons.** The paper contains ablations across sampling orders and
- **Limited analysis of failure modes and generality.** The paper demonstrates gains on ImageNet and a text→image generator, but it is unclear how GtR behaves with different tokenizers (VQ variants, continuous latents), non-square layouts, or with very detail-dense images where global structure is weak. The paper makes the empirical claim that reconstruction is “much easier than creation” but does not deeply analyze where this breaks down. - **Limited theoretical justification for checkerboard s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
