LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

Xidong Wang; Dingjie Song; Shunian Chen; Junyin Chen; Zhenyang Cai; Chen Zhang; Lichao Sun; Benyou Wang

arXiv:2409.02889·cs.CL·September 24, 2025·2 cites

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

Xidong Wang, Dingjie Song, Shunian Chen, Junyin Chen, Zhenyang Cai, Chen Zhang, Lichao Sun, Benyou Wang

PDF

Open Access 1 Repo 4 Models 1 Video 3 Reviews

TL;DR

LongLLaVA introduces a hybrid architecture and training strategy that significantly enhances multi-modal large language models, enabling efficient processing of up to 1000 images with high performance and low resource consumption.

Contribution

The paper presents a novel hybrid architecture combining Mamba and Transformer blocks, along with new data construction and training methods for scalable multi-modal LLMs.

Findings

01

Processes nearly 1000 images on a single GPU

02

Achieves competitive benchmark results

03

Maintains high throughput and low memory usage

Abstract

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is critical for advancing video understanding and high-resolution image analysis. Achieving this requires systematic improvements in model architecture, data construction, and training strategies, particularly to address challenges such as performance degradation with increasing image counts and high computational costs. In this paper, we propose a hybrid architecture that integrates Mamba and Transformer blocks, introduce data construction methods that capture both temporal and spatial dependencies, and employ a progressive training strategy. Our released model, LongLLaVA (\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant), demonstrates an effective balance between efficiency and performance. LongLLaVA achieves competitive results across various…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 5

Strengths

LongLLaVA can handle up to 1173 images on a single 80GB GPU, showing excellent processing power of handling more images, enabling more spatial and temporal information. The proposed efficient hybrid architecture improving throughput and reducing memory usage while maintaining good performance in both ICL and VLM benchmarks. The enhanced data construction and progressive training strategy guide the model to distinguish temporal and spatial dependencies among images.

Weaknesses

My main concern for the paper is that the motivation is not clearly justified. LongLLaVa aims to enable more frames (tokens) for Vision Language Models (VLM) and thus explores a Mamba-based architecture. However, the reason for choosing a hybrid architecture is confusing. Lines 127-128 mention the Mamba model's in-context learning (ICL) capability as indispensable. Is there any evidence or literature to support that this is a weakness of the Mamba architecture itself rather than a result of trai

Reviewer 02Rating 5Confidence 5

Strengths

- The integration of Mamba and Transformer layers enables LongLLaVA to achieve quasi-linear computational complexity while supporting in-context learning. This design is well motivated to deal with long videos. - Detailed experimental results, ablation studies, and diagnostic evaluations have been conducted on benchmarks like MileBench, VNBench, and Video-MME, showcasing that LongLLaVA has reasonable performances on multi-image and video tasks. - LongLLaVA maintains a lower computational cost co

Weaknesses

- The use of LongLLaVA-9B from Expert-0 in the Mamba (Jamba) MoE Layer appears unorthodox and lacks sufficient justification. The VLM's capabilities are heavily dependent on its underlying LLM, and it's unclear how well Jamba expert-0 performs as an LLM. - The evaluation baselines are outdated and not sufficiently competitive. While LongVA-7B is mentioned, it's not directly compared against, and several newer 7B VLMs with superior performance are excluded. Although fair comparisons are challengi

Reviewer 03Rating 5Confidence 3

Strengths

- The motivations are outlined very clearly and the way the authors chose to address the challenges presented makes sense - Hybrid architecture efficiency analysis - Ablation of architecture choices - Multiple model scales presents some scaling analysis - Many experiments, including additional applications to healthcare and science

Weaknesses

- There is no ablation study showcasing the 3-stage finetuning vs a typical 1-stage finetuning step with all of the data across the three stages mixed in.

Code & Models

Repositories

freedomintelligence/longllava
pytorchOfficial

Models

Videos

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture· underline

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image and Object Detection Techniques · Image Processing Techniques and Applications

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Label Smoothing · Linear Layer · Adam · Dropout · Layer Normalization