MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
Xiaoran Fan, Zhichao Sun, Tao Ji, Lixing Shen, Tao Gui

TL;DR
This paper introduces MHA2MLA-VLM, a framework that efficiently converts existing vision-language models to use multi-head latent attention, reducing memory and computation during inference without extensive retraining.
Contribution
It proposes a novel, parameter-efficient method for adapting off-the-shelf VLMs to MLA architecture using modality-aware techniques and low-rank approximation, minimizing performance loss.
Findings
Restores original model performance with minimal data
Reduces Key-Value cache size significantly
Seamlessly integrates with KV quantization
Abstract
As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗cnxup/Qwen2.5-VL-7B-MLA-stage1-rope32model· 13 dl· ♡ 113 dl♡ 1
- 🤗cnxup/SVD-Initmodel· ♡ 1♡ 1
- 🤗cnxup/Qwen2.5-VL-7B-MLA-stage2-rope32-d_kv_32model· 2 dl· ♡ 12 dl♡ 1
- 🤗cnxup/Qwen2.5-VL-7B-MLA-stage2-rope32-d_kv_64model· 1 dl· ♡ 11 dl♡ 1
- 🤗cnxup/Qwen2.5-VL-7B-MLA-stage2-rope32-d_kv_128model· 1 dl· ♡ 11 dl♡ 1
- 🤗cnxup/LLaVA-NeXT-8B-MLA-stage2-rope32-d_kv_32model· 13 dl· ♡ 113 dl♡ 1
- 🤗cnxup/LLaVA-NeXT-8B-MLA-stage2-rope32-d_kv_128model· 11 dl· ♡ 111 dl♡ 1
- 🤗cnxup/LLaVA-NeXT-8B-MLA-stage2-rope32-d_kv_64model· 13 dl· ♡ 113 dl♡ 1
- 🤗cnxup/LLaVA-NeXT-8B-MLA-stage1-rope32model· 13 dl· ♡ 113 dl♡ 1
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
