TL;DR
LongLLaVA introduces a hybrid architecture and training strategy that significantly enhances multi-modal large language models, enabling efficient processing of up to 1000 images with high performance and low resource consumption.
Contribution
The paper presents a novel hybrid architecture combining Mamba and Transformer blocks, along with new data construction and training methods for scalable multi-modal LLMs.
Findings
Processes nearly 1000 images on a single GPU
Achieves competitive benchmark results
Maintains high throughput and low memory usage
Abstract
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is critical for advancing video understanding and high-resolution image analysis. Achieving this requires systematic improvements in model architecture, data construction, and training strategies, particularly to address challenges such as performance degradation with increasing image counts and high computational costs. In this paper, we propose a hybrid architecture that integrates Mamba and Transformer blocks, introduce data construction methods that capture both temporal and spatial dependencies, and employ a progressive training strategy. Our released model, LongLLaVA (\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant), demonstrates an effective balance between efficiency and performance. LongLLaVA achieves competitive results across various…
Peer Reviews
Decision·Submitted to ICLR 2025
LongLLaVA can handle up to 1173 images on a single 80GB GPU, showing excellent processing power of handling more images, enabling more spatial and temporal information. The proposed efficient hybrid architecture improving throughput and reducing memory usage while maintaining good performance in both ICL and VLM benchmarks. The enhanced data construction and progressive training strategy guide the model to distinguish temporal and spatial dependencies among images.
My main concern for the paper is that the motivation is not clearly justified. LongLLaVa aims to enable more frames (tokens) for Vision Language Models (VLM) and thus explores a Mamba-based architecture. However, the reason for choosing a hybrid architecture is confusing. Lines 127-128 mention the Mamba model's in-context learning (ICL) capability as indispensable. Is there any evidence or literature to support that this is a weakness of the Mamba architecture itself rather than a result of trai
- The integration of Mamba and Transformer layers enables LongLLaVA to achieve quasi-linear computational complexity while supporting in-context learning. This design is well motivated to deal with long videos. - Detailed experimental results, ablation studies, and diagnostic evaluations have been conducted on benchmarks like MileBench, VNBench, and Video-MME, showcasing that LongLLaVA has reasonable performances on multi-image and video tasks. - LongLLaVA maintains a lower computational cost co
- The use of LongLLaVA-9B from Expert-0 in the Mamba (Jamba) MoE Layer appears unorthodox and lacks sufficient justification. The VLM's capabilities are heavily dependent on its underlying LLM, and it's unclear how well Jamba expert-0 performs as an LLM. - The evaluation baselines are outdated and not sufficiently competitive. While LongVA-7B is mentioned, it's not directly compared against, and several newer 7B VLMs with superior performance are excluded. Although fair comparisons are challengi
- The motivations are outlined very clearly and the way the authors chose to address the challenges presented makes sense - Hybrid architecture efficiency analysis - Ablation of architecture choices - Multiple model scales presents some scaling analysis - Many experiments, including additional applications to healthcare and science
- There is no ablation study showcasing the 3-stage finetuning vs a typical 1-stage finetuning step with all of the data across the three stages mixed in.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image and Object Detection Techniques · Image Processing Techniques and Applications
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Label Smoothing · Linear Layer · Adam · Dropout · Layer Normalization
