MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu, Purwarianti, Alham Fikri Aji

TL;DR
This paper introduces MLKV, a novel method for sharing Key-Value caches across transformer layers, significantly reducing memory usage during autoregressive inference with minimal performance impact.
Contribution
MLKV extends KV sharing across layers, surpassing previous methods like MQA and GQA in memory efficiency for transformer inference.
Findings
Reduces KV cache size by up to 6x compared to MQA
Maintains near-original performance levels on NLP benchmarks
Demonstrates effective memory savings in large-scale transformer deployment
Abstract
Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Error Correcting Code Techniques · Digital Filter Design and Implementation
MethodsAttention Is All You Need · Dense Connections · Feedforward Network · Softmax · Multi-Query Attention · Grouped-query attention
