StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models
Keli Liu, Zhendong Wang, Wengang Zhou, Houqiang Li

TL;DR
StepVAR is a novel pruning method for visual autoregressive models that accelerates inference by jointly considering structural and textural importance, maintaining quality while reducing computational cost.
Contribution
We introduce a training-free token pruning framework that combines high-pass filtering and PCA to preserve both local textures and global structure in VAR models.
Findings
Achieves significant inference speedup in VAR models.
Maintains high-quality visual generation comparable to full models.
Outperforms existing acceleration methods across multiple datasets.
Abstract
Visual AutoRegressive (VAR) models based on next-scale prediction enable efficient hierarchical generation, yet the inference cost grows quadratically at high resolutions. We observe that the computationally intensive later scales predominantly refine high-frequency textures and exhibit substantial spatial redundancy, in contrast to earlier scales that determine the global structural layout. Existing pruning methods primarily focus on high-frequency detection for token selection, often overlooking structural coherence and consequently degrading global semantics. To address this limitation, we propose StepVAR, a training-free token pruning framework that accelerates VAR inference by jointly considering structural and textural importance. Specifically, we employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
