SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
Zihao Wang, Bin Cui, Shaoduo Gan

TL;DR
SqueezeAttention introduces a layer-wise and sequence-wise optimization method for KV-cache in LLM inference, significantly reducing memory usage and increasing throughput by dynamically allocating cache budgets based on layer importance.
Contribution
The paper proposes a novel layer-wise importance measurement and on-the-fly KV-cache allocation method, improving inference efficiency over existing uniform approaches.
Findings
Achieves 30% to 70% memory reduction.
Up to 2.2x throughput improvement.
Effective across various LLMs and benchmarks.
Abstract
Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has been considered critical to saving the cost of inference. Most of the existing KV-cache compression algorithms attempted to sparsify the sequence of tokens by taking advantage of the different importance of tokens. However, most of these methods treat all layers equally, allocating the same KV budget to each layer. This approach is suboptimal, as some layers may be less sensitive to input tokens yet still receive the same budget as others. In this work, we found that by identifying the importance of attention layers, we could optimize the KV-cache jointly from two dimensions, i.e., sequence-wise and layer-wise. Based on our observations regarding layer-wise importance in inference, we propose SqueezeAttention to precisely optimize the allocation of KV-cache budget among layers on-the-fly and then incorporate three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
