ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2
Wenjun Huang, Jiakai Pan, Jiahao Tang, Yanyu Ding, Yifei Xing, Yuhe, Wang, Zhengzhuo Wang, Jianguo Hu

TL;DR
ML-Mamba introduces an efficient multimodal language model leveraging Mamba-2 for faster inference and competitive performance, replacing traditional Transformers with a linear, scalable architecture suitable for long sequences and multimodal tasks.
Contribution
The paper presents ML-Mamba, a novel multimodal language model that integrates Mamba-2 for improved efficiency and introduces the Mamba-2 Scan Connector for enhanced multimodal representation.
Findings
ML-Mamba achieves comparable performance to state-of-the-art models.
It demonstrates faster inference speeds due to linear scalability.
Mamba-2-based models outperform Mamba-1 variants in efficiency and effectiveness.
Abstract
Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this issue, we introduce ML-Mamba, a multimodal language model, which utilizes the latest and efficient Mamba-2 model for inference. Mamba-2 is known for its linear scalability and fast processing of long sequences. We replace the Transformer-based backbone with a pre-trained Mamba-2 model and explore methods for integrating 2D visual selective scanning mechanisms into multimodal learning while also trying various visual encoders and Mamba-2 model variants. Our extensive experiments in various multimodal benchmark tests demonstrate the competitive performance of ML-Mamba and highlight the potential of state space models in multimodal tasks. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Label Smoothing · Adam · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention
