Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

Kunjun Li; Zigeng Chen; Cheng-Yen Yang; Jenq-Neng Hwang

arXiv:2505.19602·cs.LG·May 27, 2025

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

Kunjun Li, Zigeng Chen, Cheng-Yen Yang, Jenq-Neng Hwang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ScaleKV, a cache compression framework for visual autoregressive models that significantly reduces memory usage during inference while maintaining high-quality output.

Contribution

ScaleKV leverages layer-specific attention patterns to optimize cache management, enabling efficient multi-scale inference in VAR models.

Findings

01

Reduces KV cache memory to 10% of original

02

Maintains pixel-level fidelity in text-to-image generation

03

Improves efficiency and scalability of VAR models

Abstract

Visual Autoregressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction approach, which yields substantial improvements in efficiency, scalability, and zero-shot generalization. Nevertheless, the coarse-to-fine methodology inherent in VAR results in exponential growth of the KV cache during inference, causing considerable memory consumption and computational redundancy. To address these bottlenecks, we introduce ScaleKV, a novel KV cache compression framework tailored for VAR architectures. ScaleKV leverages two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners. Drafters exhibit dispersed attention across multiple scales, thereby requiring greater cache…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stargazerx0/scalekv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Data Compression Techniques

MethodsSoftmax · Attention Is All You Need · Focus