CoKV: Optimizing KV Cache Allocation via Cooperative Game

Qiheng Sun; Hongwei Zhang; Haocheng Xia; Jiayao Zhang; Jinfei Liu; Kui; Ren

arXiv:2502.17501·cs.LG·February 26, 2025

CoKV: Optimizing KV Cache Allocation via Cooperative Game

Qiheng Sun, Hongwei Zhang, Haocheng Xia, Jiayao Zhang, Jinfei Liu, Kui, Ren

PDF

Open Access 1 Repo

TL;DR

CoKV introduces a cooperative game-based approach to optimize key-value cache allocation in large language models, significantly improving resource efficiency and model performance.

Contribution

This paper presents CoKV, a novel method modeling head cooperation in cache allocation as a cooperative game, enhancing resource management in LLMs.

Findings

01

Achieves state-of-the-art results on LongBench benchmark.

02

Effectively allocates cache resources among attention heads.

03

Improves model inference efficiency.

Abstract

Large language models (LLMs) have achieved remarkable success on various aspects of human life. However, one of the major challenges in deploying these models is the substantial memory consumption required to store key-value pairs (KV), which imposes significant resource demands. Recent research has focused on KV cache budget allocation, with several approaches proposing head-level budget distribution by evaluating the importance of individual attention heads. These methods, however, assess the importance of heads independently, overlooking their cooperative contributions within the model, which may result in a deviation from their true impact on model performance. In light of this limitation, we propose CoKV, a novel method that models the cooperation between heads in model inference as a cooperative game. By evaluating the contribution of each head within the cooperative game, CoKV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nawei1010/CoKV
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Advanced Wireless Network Optimization · Cooperative Communication and Network Coding

MethodsSoftmax · Attention Is All You Need