Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

William Brandon; Mayank Mishra; Aniruddha Nrusimha; Rameswar Panda,; Jonathan Ragan Kelly

arXiv:2405.12981·cs.LG·May 22, 2024·2 cites

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda,, Jonathan Ragan Kelly

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces Cross-Layer Attention (CLA), a novel method that further reduces transformer KV cache size by sharing key-value heads across layers, enabling longer sequences and larger batch sizes with minimal accuracy loss.

Contribution

The paper proposes Cross-Layer Attention (CLA), a new attention mechanism that shares key-value heads between layers, significantly reducing memory usage in transformer models.

Findings

01

CLA reduces KV cache size by 2x compared to MQA.

02

CLA maintains nearly the same accuracy as unmodified MQA.

03

Experiments show improved memory/accuracy tradeoffs for large language models.

Abstract

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy. In this paper, we show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers, yielding a new attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention· slideslive

Taxonomy

TopicsLow-power high-performance VLSI design · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization

MethodsDense Connections · Softmax · Attention Is All You Need · Feedforward Network · Multi-Query Attention · Grouped-query attention