LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, Chunyuan Li

TL;DR
LLaVA-Video is a new video large multimodal model trained on a synthetic dataset, enabling effective video instruction-following without extensive real-world data collection.
Contribution
We created a high-quality synthetic dataset for video instruction tuning and trained a novel video LMM, demonstrating strong benchmark performance.
Findings
Effective video instruction-following achieved
Synthetic data enhances model performance
Dataset and pipeline will be publicly released
Abstract
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗lmms-lab/LLaVA-NeXT-Video-32B-Qwenmodel· 192 dl· ♡ 17192 dl♡ 17
- 🤗lmms-lab/LLaVA-Video-72B-Qwen2model· 468 dl· ♡ 22468 dl♡ 22
- 🤗lmms-lab/LLaVA-Video-7B-Qwen2model· 27k dl· ♡ 12527k dl♡ 125
- 🤗lmms-lab/LLaVA-Video-7B-Qwen2-Video-Onlymodel· 658 dl· ♡ 6658 dl♡ 6
- 🤗ruili0/LLaVA-Video-7B-Qwen2-TPOmodel· 16 dl· ♡ 216 dl♡ 2
- 🤗zooblastlbz/id-alignmodel
- 🤗QinHW/LLaVA-Video-7B-Qwen2model· 1 dl1 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Experimental Learning in Engineering · Digital Filter Design and Implementation
