Multimodal Instruction Tuning with Hybrid State Space Models
Jianing Zhou, Han Li, Shuai Zhang, Ning Xie, Ruijie Wang, Xiaohan Nie,, Sheng Liu, Lingyun Wang

TL;DR
This paper introduces a hybrid transformer-MAMBA model that efficiently processes extremely long multimodal contexts exceeding 100,000 tokens, significantly improving inference speed and flexibility for high-resolution images and videos.
Contribution
The paper presents a novel hybrid transformer-MAMBA architecture that handles long multimodal contexts efficiently, surpassing existing models in both performance and inference speed.
Findings
Handles over 100k tokens in multimodal inputs
Achieves 4x inference efficiency gains for high-res images and videos
First to train on low-res data while inferring on high-res inputs
Abstract
Handling lengthy context is crucial for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs) in applications such as processing high-resolution images or high frame rate videos. The rise in image resolution and frame rate substantially increases computational demands due to the increased number of input tokens. This challenge is further exacerbated by the quadratic complexity with respect to sequence length of the self-attention mechanism. Most prior works either pre-train models with long contexts, overlooking the efficiency problem, or attempt to reduce the context length via downsampling (e.g., identify the key image patches or frames) to decrease the context length, which may result in information loss. To circumvent this issue while keeping the remarkable effectiveness of MLLMs, we propose a novel approach using a hybrid…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper is well-written and easy to understand, making it accessible to a wide audience. The use of State Space Models (SSMs) to reduce inference costs is a reasonable and potentially beneficial approach.
1. Limited Novelty: The framework resembles to LLaVA, that a vision encoder, an MLP adapter, and an LLM backbone. The training pipeline also resembles LLaVA, that first vision-language alignment training for the adapter, then instruction tuning for the LLM backbone. Train-short-infrence-long is not a new technique that is used for input length extrapolation in LLMs [1]. The hybrid model structure is from Jamba. To sum, I don’t think the novelty is enough for an ICLR paper. 2. It’s better to sho
1. The hybrid transformer-Mamba architecture effectively manages long context inputs, significantly improving inference efficiency while maintaining competitive accuracy. 1. The "train-on-short-infer-on-long" method reduces training complexity and computational cost, making it practical for training large LMMs and on high-resolution images and long videos. 1. The model demonstrates adaptability to various resolutions and frame rates, showcasing flexibility in handling diverse multimodal tasks.
1. The reviewer is concerned with the novelty of this paper. As far as the reviewer can tell, the only contribution and modification made in this paper is replacing a decoder-only LLM with a state-space LLM. The claimed advantage of MMJAMBA, such as computational efficiency and the "train-on-short-infer-on-long" method are rooted in the state-space LLM, not from the novel design of MMJAMBA. Other than the different choice of LMM, everything else remains canonical to standard LMMs. For example, a
The use of a SSM architecture to enhance the modeling efficiency for long contexts and tokens in MLLM is intuitively motivated. Overall, the experimental section of the study appears to be thorough and solid, both in terms of content and effort. It also confirms that the proposed model indeed improves both efficiency and performance.
- The primary concern of the reviewer is the possible lack of technical innovation by the authors. Although the idea of leveraging a SSM architecture to enhance the modeling efficiency for long contexts and tokens in MLLM is interesting, the reviewer notes (in sec. 3) that the authors have simply applied the existing SSM framework to the modeling and processing of images and videos without any additional technical improvements or innovation (since this is basically explored in existing research
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multi-Agent Systems and Negotiation
