StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models
Senmao Li, Kai Wang, Salman Khan, Fahad Shahbaz Khan, Jian Yang, Yaxing Wang

TL;DR
StageVAR is a stage-aware acceleration framework for visual autoregressive models that significantly speeds up image generation by selectively pruning less critical later stages without retraining.
Contribution
It introduces a novel stage-aware acceleration method that leverages the importance of early and late stages in VAR models, enabling efficient image generation without additional training.
Findings
Achieves up to 3.4x speedup in image generation.
Maintains high-quality outputs with minimal performance drop.
Outperforms existing acceleration methods across benchmarks.
Abstract
Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper presents new findings and observations regarding generation structure of VAR. - The proposed methods are intuitive, simple to implement, and plug-and-play applicable to existing VAR models. - Experimental results show StageVAR can accelerate VAR generation process by ~3x without significant degradation in image quality.
- **Novelty** : (i) The claim that VAR's image generation is divided into three phases is not highly interesting, given the sequential frequency generation nature of scale-wise methods. In fact, this was already reported in papers like FastVAR. (ii) At least 50% of StageVAR's acceleration effect comes from CFG elimination in the fidelity refinement phase. However, this is closer to a simple heuristic than an academic discovery. - **Paper Structure** : The paper's structure is confusing. For ex
1. The paper is easy to read and the figure is easy to follow. 2. The idea of using random projection for low rank feature is interesting.
1. The paper’s motivation is weak because the semantic irrelevance and low-rank observations appear decoupled, lacking a unified empirical justification. 2. The novelty is incremental, as the three-stage decomposition mainly refines FastVAR’s two-stage insight, with the “semantic stage” contribution limited to disabling text prompts after a certain scale. 3. The predetermined rank r is derived from statistics on a specific benchmark and may not generalize well to other data distributions or pro
1. The paper provides a comprehensive and detailed analysis of the generation process of the VAR model. It uses the observation to identify computational redundancy, proposing a new acceleration method that makes sense. 2. The paper's presentation is very clear and well-organized. 3. Comprehensive qualitative and quantitative experiments demonstrate the effectiveness of the proposed method in accelerating two large-scale VAR-based text-to-image models. 4. The ablation study in the paper analy
1. Does the proposed method still demonstrate a superior efficiency-quality trade-off on the larger infinity-8B model? 2. Why does StageVAR have a 3.4x speedup on infinity-2b but only a 1.7x speedup on STAR? What causes this significant difference? 3. The observations in the paper seem to have some similarities to those in CoDe [1]. [1] "Collaborative decoding makes visual auto-regressive modeling efficient." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Multimodal Machine Learning Applications
