CoKV: Optimizing KV Cache Allocation via Cooperative Game
Qiheng Sun, Hongwei Zhang, Haocheng Xia, Jiayao Zhang, Jinfei Liu, Kui, Ren

TL;DR
CoKV introduces a cooperative game-based approach to optimize key-value cache allocation in large language models, significantly improving resource efficiency and model performance.
Contribution
This paper presents CoKV, a novel method modeling head cooperation in cache allocation as a cooperative game, enhancing resource management in LLMs.
Findings
Achieves state-of-the-art results on LongBench benchmark.
Effectively allocates cache resources among attention heads.
Improves model inference efficiency.
Abstract
Large language models (LLMs) have achieved remarkable success on various aspects of human life. However, one of the major challenges in deploying these models is the substantial memory consumption required to store key-value pairs (KV), which imposes significant resource demands. Recent research has focused on KV cache budget allocation, with several approaches proposing head-level budget distribution by evaluating the importance of individual attention heads. These methods, however, assess the importance of heads independently, overlooking their cooperative contributions within the model, which may result in a deviation from their true impact on model performance. In light of this limitation, we propose CoKV, a novel method that models the cooperation between heads in model inference as a cooperative game. By evaluating the contribution of each head within the cooperative game, CoKV…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Wireless Network Optimization · Cooperative Communication and Network Coding
MethodsSoftmax · Attention Is All You Need
