Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads
Xingyang He, Jie Liu, Shaowei Chen

TL;DR
Task-KV optimizes KV cache allocation in large language models by semantically differentiating attention heads, significantly improving inference efficiency and effectiveness across various tasks.
Contribution
It introduces a novel semantic differentiation approach for dynamic KV cache allocation, enhancing task adaptability and memory efficiency in LLM inference.
Findings
Outperforms existing cache optimization methods on multiple benchmarks.
Effectively preserves semantic information with differentiated cache budgets.
Improves inference speed and memory usage across different model architectures.
Abstract
KV cache is a widely used acceleration technique for large language models (LLMs) inference. However, its memory requirement grows rapidly with input length. Previous studies have reduced the size of KV cache by either removing the same number of unimportant tokens for all attention heads or by allocating differentiated KV cache budgets for pre-identified attention heads. However, due to the importance of attention heads varies across different tasks, the pre-identified attention heads fail to adapt effectively to various downstream tasks. To address this issue, we propose Task-KV, a method that leverages the semantic differentiation of attention heads to allocate differentiated KV cache budgets across various tasks. We demonstrate that attention heads far from the semantic center (called heterogeneous heads) make an significant contribution to task outputs and semantic understanding.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
MethodsSoftmax · Attention Is All You Need · Attention Sinks
