BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference
Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut, Kandemir, Venkatram Vishwanath

TL;DR
BaKlaVa is a novel method that allocates memory for KV-caches in LLM inference based on their importance, significantly reducing memory usage while maintaining performance.
Contribution
We propose BaKlaVa, a profiling-based approach for optimal memory allocation of KV-caches across attention heads in LLMs, improving efficiency over uniform strategies.
Findings
Achieved up to 70% compression ratio with maintained baseline performance.
Demonstrated up to tenfold accuracy improvement at high compression levels.
Validated on LLaMA-3-8B and Qwen2.5-7B models.
Abstract
In Large Language Model (LLM) inference, Key-Value (KV) caches (KV-caches) are essential for reducing time complexity. However, they result in a linear increase in GPU memory as the context length grows. While recent work explores KV-cache eviction and compression policies to reduce memory usage, they often consider uniform KV-caches across all attention heads, leading to suboptimal performance. We introduce BaKlaVa, a method to allocate optimal memory for individual KV-caches across the model by estimating the importance of each KV-cache. Our empirical analysis demonstrates that not all KV-caches are equally critical for LLM performance. Using a one-time profiling approach, BaKlaVa assigns optimal memory budgets to each KV-cache. We evaluated our method on LLaMA-3-8B, and Qwen2.5-7B models, achieving up to a 70\% compression ratio while keeping baseline performance and delivering up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Network Packet Processing and Optimization
MethodsSoftmax · Attention Is All You Need
