ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Wenjun Huang; Jiakai Pan; Jiahao Tang; Yanyu Ding; Yifei Xing; Yuhe; Wang; Zhengzhuo Wang; Jianguo Hu

arXiv:2407.19832·cs.CV·August 22, 2024·1 cites

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Wenjun Huang, Jiakai Pan, Jiahao Tang, Yanyu Ding, Yifei Xing, Yuhe, Wang, Zhengzhuo Wang, Jianguo Hu

PDF

Open Access

TL;DR

ML-Mamba introduces an efficient multimodal language model leveraging Mamba-2 for faster inference and competitive performance, replacing traditional Transformers with a linear, scalable architecture suitable for long sequences and multimodal tasks.

Contribution

The paper presents ML-Mamba, a novel multimodal language model that integrates Mamba-2 for improved efficiency and introduces the Mamba-2 Scan Connector for enhanced multimodal representation.

Findings

01

ML-Mamba achieves comparable performance to state-of-the-art models.

02

It demonstrates faster inference speeds due to linear scalability.

03

Mamba-2-based models outperform Mamba-1 variants in efficiency and effectiveness.

Abstract

Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this issue, we introduce ML-Mamba, a multimodal language model, which utilizes the latest and efficient Mamba-2 model for inference. Mamba-2 is known for its linear scalability and fast processing of long sequences. We replace the Transformer-based backbone with a pre-trained Mamba-2 model and explore methods for integrating 2D visual selective scanning mechanisms into multimodal learning while also trying various visual encoders and Mamba-2 model variants. Our extensive experiments in various multimodal benchmark tests demonstrate the competitive performance of ML-Mamba and highlight the potential of state space models in multimodal tasks. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAttention Is All You Need · Label Smoothing · Adam · Linear Layer · Byte Pair Encoding · Layer Normalization · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Multi-Head Attention