Multimodal Mamba: Decoder-only Multimodal State Space Model via   Quadratic to Linear Distillation

Bencheng Liao; Hongyuan Tao; Qian Zhang; Tianheng Cheng and; Yingyue Li; Haoran Yin; Wenyu Liu; Xinggang Wang

arXiv:2502.13145·cs.CV·March 19, 2025

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng and, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces mmMamba, a novel framework for converting large multimodal language models into efficient, linear-complexity architectures through progressive distillation, enabling faster inference and reduced memory usage while maintaining strong multimodal capabilities.

Contribution

It presents a new distillation method to transform trained decoder-only MLLMs into linear-complexity models without pre-trained RNNs or vision encoders, supporting hybrid architectures.

Findings

01

mmMamba-linear achieves 20.6× speedup and 75.8% GPU memory reduction.

02

mmMamba-hybrid significantly improves performance, nearing original model capabilities.

03

The approach enables efficient multimodal models with competitive performance.

Abstract

Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hustvl/mmmamba
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Position-Wise Feed-Forward Layer · Adam