LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun, Ma, Chunyuan Li

TL;DR
LLaVA-NeXT-Interleave advances large multimodal models by enabling them to handle multi-image, video, and 3D tasks simultaneously, with a comprehensive dataset and benchmark, demonstrating superior performance and emerging cross-modal capabilities.
Contribution
It introduces a unified framework and dataset for multi-image, video, and 3D tasks in LMMs, enabling cross-scenario generalization and new capabilities.
Findings
Achieves leading results on multi-image, video, and 3D benchmarks.
Maintains performance on single-image tasks.
Exhibits emerging cross-modal transfer capabilities.
Abstract
Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗lmms-lab/llava-next-interleave-qwen-7b-dpomodel· 97 dl· ♡ 1297 dl♡ 12
- 🤗lmms-lab/llava-next-interleave-qwen-0.5bmodel· 31 dl· ♡ 1231 dl♡ 12
- 🤗llava-hf/llava-interleave-qwen-0.5b-hfmodel· 191k dl· ♡ 36191k dl♡ 36
- 🤗llava-hf/llava-interleave-qwen-7b-hfmodel· 660 dl· ♡ 29660 dl♡ 29
- 🤗llava-hf/llava-interleave-qwen-7b-dpo-hfmodel· 39 dl· ♡ 239 dl♡ 2
- 🤗luisresende13/llava-interleave-qwen-0.5b-hfmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ovmodel· ♡ 1♡ 1
- 🤗zooblastlbz/id-alignmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation and Modeling Applications · Human Motion and Animation · Video Analysis and Summarization
MethodsFocus
