TL;DR
SparVAR is a training-free framework that accelerates visual autoregressive modeling by exploiting sparsity in attention mechanisms, significantly reducing inference time while preserving image quality.
Contribution
It introduces a novel sparse attention prediction method and efficient kernel implementation for large-scale VAR, achieving over 5x speed-up without sacrificing details.
Findings
Reduces generation time of 8B models to 1 second for high-res images.
Achieves 1.57x speed-up over FlashAttention while maintaining image quality.
Up to 2.28x acceleration when combined with scale-skipping strategies.
Abstract
Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present \textbf{SparVAR}, a training-free acceleration framework that exploits three properties of VAR attention: \textbf{(i) strong attention sinks}, \textbf{(ii) cross-scale activation similarity}, and \textbf{(iii) pronounced locality}. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
