LLaVA-OneVision: Easy Visual Task Transfer

Bo Li; Yuanhan Zhang; Dong Guo; Renrui Zhang; Feng Li; Hao Zhang,; Kaichen Zhang; Peiyuan Zhang; Yanwei Li; Ziwei Liu; Chunyuan Li

arXiv:2408.03326·cs.CV·October 29, 2024·27 cites

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang,, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

PDF

Open Access 2 Repos 10 Models 3 Datasets

TL;DR

LLaVA-OneVision is a versatile large multimodal model that excels across single-image, multi-image, and video tasks, demonstrating strong transfer learning and emerging capabilities in visual understanding.

Contribution

It introduces a unified model capable of handling diverse visual scenarios with effective transfer learning, advancing open large multimodal models.

Findings

01

First single model to excel in image, multi-image, and video tasks

02

Demonstrates strong transfer learning across modalities

03

Achieves new capabilities in video understanding

Abstract

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Teleoperation and Haptic Systems