LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large   Multimodal Models

Feng Li; Renrui Zhang; Hao Zhang; Yuanhan Zhang; Bo Li; Wei Li; Zejun; Ma; Chunyuan Li

arXiv:2407.07895·cs.CV·July 30, 2024·22 cites

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun, Ma, Chunyuan Li

PDF

Open Access 3 Repos 8 Models

TL;DR

LLaVA-NeXT-Interleave advances large multimodal models by enabling them to handle multi-image, video, and 3D tasks simultaneously, with a comprehensive dataset and benchmark, demonstrating superior performance and emerging cross-modal capabilities.

Contribution

It introduces a unified framework and dataset for multi-image, video, and 3D tasks in LMMs, enabling cross-scenario generalization and new capabilities.

Findings

01

Achieves leading results on multi-image, video, and 3D benchmarks.

02

Maintains performance on single-image tasks.

03

Exhibits emerging cross-modal transfer capabilities.

Abstract

Visual instruction tuning has made considerable strides in enhancing the capabilities of Large Multimodal Models (LMMs). However, existing open LMMs largely focus on single-image tasks, their applications to multi-image scenarios remains less explored. Additionally, prior LMM research separately tackles different scenarios, leaving it impossible to generalize cross scenarios with new emerging capabilities. To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs. To enable these capabilities, we regard the interleaved data format as a general template and compile the M4-Instruct dataset with 1,177.6k samples, spanning 4 primary domains with 14 tasks and 41 datasets. We also curate the LLaVA-Interleave Bench to comprehensively evaluate the multi-image performance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation and Modeling Applications · Human Motion and Animation · Video Analysis and Summarization

MethodsFocus