Latent Multi-Head Attention for Small Language Models
Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat

TL;DR
This paper investigates latent multi-head attention (MLA) for small language models, demonstrating that MLA with rotary positional embeddings significantly reduces memory usage and maintains high performance, offering an efficient alternative to standard attention.
Contribution
It provides the first comprehensive analysis of MLA in small models, showing how MLA+RoPE achieves memory savings and performance improvements over vanilla attention.
Findings
MLA+RoPE with half-rank dimensions reduces memory by 45%.
MLA+RoPE matches vanilla attention in validation loss.
MLA with RoPE outperforms vanilla attention with a 2% accuracy gain.
Abstract
We present the first comprehensive study of latent multi-head attention (MLA) for small language models, revealing interesting efficiency-quality trade-offs. Training 30M-parameter GPT models on 100,000 synthetic stories, we benchmark three architectural variants: standard multi-head attention (MHA), MLA, and MLA with rotary positional embeddings (MLA+RoPE). Our key finding is that MLA+RoPE with half-rank latent dimensions (r = d/2) achieves a 45% KV-cache memory reduction while incurring only a 0.3% increase in validation loss (essentially matching MHA quality)- a Pareto improvement for memory constrained deployment. We further show that RoPE is crucial for MLA in small models: without it, MLA underperforms vanilla attention by 3-5%, but with RoPE, it surpasses vanilla by 2%. Inference benchmarks on NVIDIA A100 GPUs reveal that MLA with r=d/2 achieves a 1.4 times speedup over full-rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Absolute Position Encodings · Label Smoothing · Transformer · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Cosine Annealing · GPT-4
