MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer   Decoding

Zayd Muhammad Kawakibi Zuhri; Muhammad Farid Adilazuarda; Ayu; Purwarianti; Alham Fikri Aji

arXiv:2406.09297·cs.LG·October 16, 2024·1 cites

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu, Purwarianti, Alham Fikri Aji

PDF

Open Access 1 Repo

TL;DR

This paper introduces MLKV, a novel method for sharing Key-Value caches across transformer layers, significantly reducing memory usage during autoregressive inference with minimal performance impact.

Contribution

MLKV extends KV sharing across layers, surpassing previous methods like MQA and GQA in memory efficiency for transformer inference.

Findings

01

Reduces KV cache size by up to 6x compared to MQA

02

Maintains near-original performance levels on NLP benchmarks

03

Demonstrates effective memory savings in large-scale transformer deployment

Abstract

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zaydzuhri/pythia-mlkv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Error Correcting Code Techniques · Digital Filter Design and Implementation

MethodsAttention Is All You Need · Dense Connections · Feedforward Network · Softmax · Multi-Query Attention · Grouped-query attention