Effectively Compress KV Heads for LLM

Hao Yu; Zelan Yang; Shen Li; Yong Li; Jianxin Wu

arXiv:2406.07056·cs.CL·June 12, 2024

Effectively Compress KV Heads for LLM

Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu

PDF

Open Access

TL;DR

This paper introduces a novel method for compressing Key-Value heads in large language models by exploiting low-rank properties, significantly reducing memory usage while maintaining model performance.

Contribution

It proposes a new approach for KV head compression that optimizes the MHA-to-GQA transformation and ensures compatibility with rotary position embeddings, improving efficiency.

Findings

01

Compresses up to 75% of KV heads

02

Maintains performance comparable to original models

03

Reduces memory footprint for resource-constrained deployment

Abstract

The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate redundant calculations for previous tokens. Nevertheless, as context lengths and batch sizes increase, the linear expansion in memory footprint of KV caches becomes a key bottleneck of LLM deployment, which decreases generation speeds significantly. To mitigate this issue, previous techniques like multi-query attention (MQA) and grouped-query attention (GQA) have been developed, in order to reduce KV heads to accelerate inference with comparable accuracy to multi-head attention (MHA). Despite their effectiveness, existing strategies for compressing MHA often overlook the intrinsic properties of the KV caches. In this work, we explore the low-rank…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMetallurgy and Material Forming · Advancements in Photolithography Techniques

MethodsAttention Is All You Need · Dense Connections · Feedforward Network · Softmax · Multi-Query Attention · Linear Layer · Grouped-query attention · Multi-Head Attention