Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui

TL;DR
This paper introduces a data-efficient fine-tuning method to convert standard multi-head attention in large language models into a more economical multi-head latent attention, significantly reducing inference costs with minimal performance loss.
Contribution
It presents the first effective fine-tuning approach for transitioning from MHA to MLA in pre-trained LLMs without extensive retraining.
Findings
KV cache size of Llama2-7B reduced by 92.19%
Achieves performance recovery with only 0.3-0.6% data
Seamless integration with compression techniques
Abstract
Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenMOSS-Team/Llama-2-7B-MLA-d_kv_16model· 6 dl· ♡ 16 dl♡ 1
- 🤗OpenMOSS-Team/Llama-2-7B-MLA-d_kv_32model· 6 dl· ♡ 16 dl♡ 1
- 🤗OpenMOSS-Team/Llama-2-7B-MLA-d_kv_64model· 21 dl· ♡ 121 dl♡ 1
- 🤗OpenMOSS-Team/SmolLM-135M-MLA-d_kv_8-refactormodel· 3 dl· ♡ 13 dl♡ 1
- 🤗OpenMOSS-Team/Llama-2-7B-MHA-d_kv_256model· 4 dl· ♡ 34 dl♡ 3
- 🤗OpenMOSS-Team/SmolLM-135M-MLA-d_kv_8model· 1 dl· ♡ 11 dl♡ 1
- 🤗OpenMOSS-Team/SmolLM-135M-MLA-d_kv_16model
- 🤗OpenMOSS-Team/SmolLM-135M-MLA-d_kv_32model· 2 dl2 dl
- 🤗OpenMOSS-Team/SmolLM-135M-GQA-d_kv_128model
- 🤗OpenMOSS-Team/SmolLM-360M-MLA-d_kv_8model· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsAttention Is All You Need · Dense Connections · Linear Layer · Feedforward Network · Multi-Head Attention · Softmax · Grouped-query attention
