SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping
Jiajun Li, Yue Ma, Xinyu Zhang, Qingyan Wei, Songhua Liu, Linfeng Zhang

TL;DR
SkipVAR introduces a frequency-aware, adaptive acceleration framework for visual autoregressive models, significantly reducing inference time while maintaining high image quality by selectively skipping steps and bypassing redundant branches.
Contribution
The paper proposes a novel, sample-adaptive acceleration method for VAR models that leverages frequency information to dynamically optimize inference efficiency.
Findings
Achieves up to 2.62x speedup on GenEval benchmark.
Maintains high image quality with over 0.88 SSIM on average.
Effectively reduces inference latency through adaptive strategies.
Abstract
Recent studies on Visual Autoregressive (VAR) models have highlighted that high-frequency components, or later steps, in the generation process contribute disproportionately to inference latency. However, the underlying computational redundancy involved in these steps has yet to be thoroughly investigated. In this paper, we conduct an in-depth analysis of the VAR inference process and identify two primary sources of inefficiency: step redundancy and unconditional branch redundancy. To address step redundancy, we propose an automatic step-skipping strategy that selectively omits unnecessary generation steps to improve efficiency. For unconditional branch redundancy, we observe that the information gap between the conditional and unconditional branches is minimal. Leveraging this insight, we introduce unconditional branch replacement, a technique that bypasses the unconditional branch to…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper presents novel and significant observations on the important problem of VAR inference latency, identifying sample-dependent redundancy. * Based on these observations, the paper proposes a natural and reasonable method that adaptively selects acceleration strategies. * The various design choices, such as the choice of decision models, are backed by a persuasive rationale.
1. **Limited Generalizability** * The paper claims to accelerate "Visual Autoregressive Modeling," yet its entire experimental validation rests exclusively on a single model family: the Infinity-2B/8B family. This is a significant limitation. The core assumptions driving the method, such as the specific patterns of high-frequency redundancy or the convergence of L1 loss between conditional and unconditional branches, may be unique to the Infinity architecture rather than fundamental properties
- Discovers and addresses(with minimal training) the issue of late-stage high frequency redundancy and redundant CFG passes in VAR generation. - Two intuitive signals (Sobel and FFT) enable per sample decisions and preserve high frequency detail better than token pruning/merging at similar speedups.
- Reported speedups exclude VAE and post-processing, so end-to-end latency improvements in production are likely smaller; end-to-end measurements are needed. - Heavy reliance on classifier-free guidance, since a major gain comes from dropping the unconditional branch; applicability to non-CFG or single-branch decoders is unclear. - In Table 3, the strongest ~2.6× result does not use the decision model, which obfuscates the efficacy of the decision model.
1. Sample-adaptive decisions: Unlike prior acceleration methods that use a fixed global ratio or policy (e.g., FastVAR), this work chooses between step-skipping and branch replacement per-sample and per-scale, which makes the approach much more practical. The results on frequency-sensitive vs. frequency-robust subsets clearly demonstrate the benefit of this adaptive decision process. 2. Simple and interpretable features: The combination of HF_Diff (local edge stability) and HF_Ratio (global hig
I'm not an expert in this area, but based on my understanding, I have the following concerns and questions. 1. Sensitivity to decision step and threshold According to paper (with appendix), the default setup uses N = 10 with SSIM thresholds {0.88, 0.86, 0.84}, but the paper doesn’t really explore how performance changes with different step counts (which may vary by model or resolution) or different thresholds. I notice that there’s a brief comparison between SSIM-based and LPIPS-based criteria
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
