SqueezeAttention: 2D Management of KV-Cache in LLM Inference via   Layer-wise Optimal Budget

Zihao Wang; Bin Cui; Shaoduo Gan

arXiv:2404.04793·cs.LG·October 11, 2024·1 cites

SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget

Zihao Wang, Bin Cui, Shaoduo Gan

PDF

Open Access 1 Repo

TL;DR

SqueezeAttention introduces a layer-wise and sequence-wise optimization method for KV-cache in LLM inference, significantly reducing memory usage and increasing throughput by dynamically allocating cache budgets based on layer importance.

Contribution

The paper proposes a novel layer-wise importance measurement and on-the-fly KV-cache allocation method, improving inference efficiency over existing uniform approaches.

Findings

01

Achieves 30% to 70% memory reduction.

02

Up to 2.2x throughput improvement.

03

Effective across various LLMs and benchmarks.

Abstract

Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has been considered critical to saving the cost of inference. Most of the existing KV-cache compression algorithms attempted to sparsify the sequence of tokens by taking advantage of the different importance of tokens. However, most of these methods treat all layers equally, allocating the same KV budget to each layer. This approach is suboptimal, as some layers may be less sensitive to input tokens yet still receive the same budget as others. In this work, we found that by identifying the importance of attention layers, we could optimize the KV-cache jointly from two dimensions, i.e., sequence-wise and layer-wise. Based on our observations regarding layer-wise importance in inference, we propose SqueezeAttention to precisely optimize the allocation of KV-cache budget among layers on-the-fly and then incorporate three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hetailang/squeezeattention
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques