FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
Nazmul Takbir, Hamidreza Alikhani, Nikil Dutt, Sangeetha Abdu Jyothi

TL;DR
FlexiCache is a hierarchical KV-cache management system that exploits the temporal stability of attention heads in LLMs to significantly reduce GPU memory usage and improve serving efficiency without sacrificing accuracy.
Contribution
It introduces a novel approach to classify attention heads as stable or unstable and manages KV-cache accordingly, enabling more efficient long-context LLM serving.
Findings
Reduces GPU memory footprint by up to 70% for long-context requests.
Improves offline serving throughput by 1.38-1.55x.
Lowers online token latency by 1.6-2.1x.
Abstract
Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
