TransMLA: Multi-Head Latent Attention Is All You Need

Fanxu Meng; Pingzhi Tang; Xiaojuan Tang; Zengwei Yao; Xing Sun; Muhan Zhang

arXiv:2502.07864·cs.LG·June 13, 2025·6 cites

TransMLA: Multi-Head Latent Attention Is All You Need

Fanxu Meng, Pingzhi Tang, Xiaojuan Tang, Zengwei Yao, Xing Sun, Muhan Zhang

PDF

Open Access 1 Repo 3 Models

TL;DR

TransMLA introduces a method to convert GQA-based models into MLA-based models, enabling faster inference and compatibility with DeepSeek optimizations while maintaining output quality with less fine-tuning.

Contribution

The paper presents TransMLA, a novel framework that converts GQA models to MLA models, achieving significant speedups and compatibility with existing DeepSeek infrastructure.

Findings

01

93% KV cache compression in LLaMA-2-7B

02

10.6x inference speedup at 8K context length

03

Requires only 6 billion tokens for fine-tuning

Abstract

In this paper, we present TransMLA, a framework that seamlessly converts any GQA-based pre-trained model into an MLA-based model. Our approach enables direct compatibility with DeepSeek's codebase, allowing these models to fully leverage DeepSeek-specific optimizations such as vLLM and SGlang. By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length while preserving meaningful output quality. Additionally, the model requires only 6 billion tokens for fine-tuning to regain performance on par with the original across multiple benchmarks. TransMLA offers a practical solution for migrating GQA-based models to the MLA structure. When combined with DeepSeek's advanced features, such as FP8 quantization and Multi-Token Prediction, even greater inference acceleration can be realized.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fxmeng/transmla
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · LLaMA · ADaptive gradient method with the OPTimal convergence rate