Effectively Compress KV Heads for LLM
Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu

TL;DR
This paper introduces a novel method for compressing Key-Value heads in large language models by exploiting low-rank properties, significantly reducing memory usage while maintaining model performance.
Contribution
It proposes a new approach for KV head compression that optimizes the MHA-to-GQA transformation and ensures compatibility with rotary position embeddings, improving efficiency.
Findings
Compresses up to 75% of KV heads
Maintains performance comparable to original models
Reduces memory footprint for resource-constrained deployment
Abstract
The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate redundant calculations for previous tokens. Nevertheless, as context lengths and batch sizes increase, the linear expansion in memory footprint of KV caches becomes a key bottleneck of LLM deployment, which decreases generation speeds significantly. To mitigate this issue, previous techniques like multi-query attention (MQA) and grouped-query attention (GQA) have been developed, in order to reduce KV heads to accelerate inference with comparable accuracy to multi-head attention (MHA). Despite their effectiveness, existing strategies for compressing MHA often overlook the intrinsic properties of the KV caches. In this work, we explore the low-rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetallurgy and Material Forming · Advancements in Photolithography Techniques
MethodsAttention Is All You Need · Dense Connections · Feedforward Network · Softmax · Multi-Query Attention · Linear Layer · Grouped-query attention · Multi-Head Attention
