Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation   of Attention Heads

Xingyang He; Jie Liu; Shaowei Chen

arXiv:2501.15113·cs.CL·January 28, 2025

Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads

Xingyang He, Jie Liu, Shaowei Chen

PDF

Open Access

TL;DR

Task-KV optimizes KV cache allocation in large language models by semantically differentiating attention heads, significantly improving inference efficiency and effectiveness across various tasks.

Contribution

It introduces a novel semantic differentiation approach for dynamic KV cache allocation, enhancing task adaptability and memory efficiency in LLM inference.

Findings

01

Outperforms existing cache optimization methods on multiple benchmarks.

02

Effectively preserves semantic information with differentiated cache budgets.

03

Improves inference speed and memory usage across different model architectures.

Abstract

KV cache is a widely used acceleration technique for large language models (LLMs) inference. However, its memory requirement grows rapidly with input length. Previous studies have reduced the size of KV cache by either removing the same number of unimportant tokens for all attention heads or by allocating differentiated KV cache budgets for pre-identified attention heads. However, due to the importance of attention heads varies across different tasks, the pre-identified attention heads fail to adapt effectively to various downstream tasks. To address this issue, we propose Task-KV, a method that leverages the semantic differentiation of attention heads to allocate differentiated KV cache budgets across various tasks. We demonstrate that attention heads far from the semantic center (called heterogeneous heads) make an significant contribution to task outputs and semantic understanding.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies

MethodsSoftmax · Attention Is All You Need · Attention Sinks