BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference

Ahmed Burak Gulhan; Krishna Teja Chitty-Venkata; Murali Emani; Mahmut; Kandemir; Venkatram Vishwanath

arXiv:2502.13176·cs.LG·February 25, 2025

BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference

Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut, Kandemir, Venkatram Vishwanath

PDF

Open Access

TL;DR

BaKlaVa is a novel method that allocates memory for KV-caches in LLM inference based on their importance, significantly reducing memory usage while maintaining performance.

Contribution

We propose BaKlaVa, a profiling-based approach for optimal memory allocation of KV-caches across attention heads in LLMs, improving efficiency over uniform strategies.

Findings

01

Achieved up to 70% compression ratio with maintained baseline performance.

02

Demonstrated up to tenfold accuracy improvement at high compression levels.

03

Validated on LLaMA-3-8B and Qwen2.5-7B models.

Abstract

In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70\% compression ratio while keeping baseline performance and delivering up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Network Packet Processing and Optimization

MethodsSoftmax · Attention Is All You Need