LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang,, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

TL;DR
LLaVA-OneVision is a versatile large multimodal model that excels across single-image, multi-image, and video tasks, demonstrating strong transfer learning and emerging capabilities in visual understanding.
Contribution
It introduces a unified model capable of handling diverse visual scenarios with effective transfer learning, advancing open large multimodal models.
Findings
First single model to excel in image, multi-image, and video tasks
Demonstrates strong transfer learning across modalities
Achieves new capabilities in video understanding
Abstract
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗NCSOFT/VARCO-VISION-14Bmodel· 14 dl· ♡ 3814 dl♡ 38
- 🤗NCSOFT/VARCO-VISION-14B-HFmodel· 19 dl· ♡ 3019 dl♡ 30
- 🤗NCSOFT/VARCO-VISION-2.0-14Bmodel· 355 dl· ♡ 46355 dl♡ 46
- 🤗NCSOFT/VARCO-VISION-2.0-1.7Bmodel· 1.8k dl· ♡ 211.8k dl♡ 21
- 🤗NCSOFT/VARCO-VISION-2.0-1.7B-OCRmodel· 671 dl· ♡ 29671 dl♡ 29
- 🤗lmms-lab/llava-onevision-qwen2-7b-ovmodel· 107k dl· ♡ 62107k dl♡ 62
- 🤗lmms-lab/llava-onevision-qwen2-0.5b-simodel· 1.4k dl· ♡ 151.4k dl♡ 15
- 🤗lmms-lab/llava-onevision-qwen2-7b-simodel· 1.9k dl· ♡ 121.9k dl♡ 12
- 🤗lmms-lab/llava-onevision-qwen2-72b-simodel· 6 dl· ♡ 16 dl♡ 1
- 🤗lmms-lab/llava-onevision-qwen2-72b-ov-sftmodel· 308 dl· ♡ 15308 dl♡ 15
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Teleoperation and Haptic Systems
