Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng and, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang

TL;DR
This paper introduces mmMamba, a novel framework for converting large multimodal language models into efficient, linear-complexity architectures through progressive distillation, enabling faster inference and reduced memory usage while maintaining strong multimodal capabilities.
Contribution
It presents a new distillation method to transform trained decoder-only MLLMs into linear-complexity models without pre-trained RNNs or vision encoders, supporting hybrid architectures.
Findings
mmMamba-linear achieves 20.6× speedup and 75.8% GPU memory reduction.
mmMamba-hybrid significantly improves performance, nearing original model capabilities.
The approach enables efficient multimodal models with competitive performance.
Abstract
Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Position-Wise Feed-Forward Layer · Adam
