TL;DR
DepthVAR introduces an adaptive, computation-efficient framework for visual autoregressive modeling by dynamically allocating processing depth per token, significantly accelerating inference with minimal quality loss.
Contribution
It presents a novel, training-free method that adaptively assigns computational depth to tokens, surpassing traditional pruning techniques in efficiency and quality.
Findings
Achieves 2.3× to 3.1× acceleration with minimal quality loss.
Outperforms existing hard-pruning methods in compute-performance trade-offs.
Demonstrates effectiveness across high-resolution image generation tasks.
Abstract
Visual Autoregressive (VAR) modeling inefficiently applies a fixed computational depth to each position when generating high-resolution images. While existing methods accelerate inference by pruning tokens using frequency maps, their binary hard-pruning approach is fundamentally limited and fails to improve quality even with better frequency estimation. Observing that VAR models possess significant depth redundancy, we propose a paradigm shift from pruning entire tokens to adaptively allocating per-token computational depth. To this end, we introduce DepthVAR, a training-free framework that dynamically allocates computation. It integrates an adaptive depth scheduler, which assigns computational depth via a cyclic rotated schedule for balanced, non-static refinement, with a dynamic inference process that translates these depths into layer-major masks, selectively applies transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
