MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

Xiaoran Fan; Zhichao Sun; Tao Ji; Lixing Shen; Tao Gui

arXiv:2601.11464·cs.CV·January 19, 2026

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

Xiaoran Fan, Zhichao Sun, Tao Ji, Lixing Shen, Tao Gui

PDF

Open Access 9 Models 1 Video

TL;DR

This paper introduces MHA2MLA-VLM, a framework that efficiently converts existing vision-language models to use multi-head latent attention, reducing memory and computation during inference without extensive retraining.

Contribution

It proposes a novel, parameter-efficient method for adapting off-the-shelf VLMs to MLA architecture using modality-aware techniques and low-rank approximation, minimizing performance loss.

Findings

01

Restores original model performance with minimal data

02

Reduces Key-Value cache size significantly

03

Seamlessly integrates with KV quantization

Abstract

As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

MHA2MLA-VLM: Enabling DeepSeek’s Economical Multi-Head Latent Attention Across Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications