ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang,, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin,, Feng Zhao, Jiaqi Wang

TL;DR
This paper introduces ShareGPT4Video, a comprehensive series including a large dataset, a captioning model, and an LVLM, all designed to enhance video understanding and generation through dense, precise captions and a novel captioning strategy.
Contribution
It presents a new high-quality video captioning dataset, a scalable captioning model, and an LVLM that achieves state-of-the-art performance on video benchmarks, addressing key challenges in temporal and detailed content understanding.
Findings
ShareGPT4Video dataset contains 40K annotated videos with rich captions.
ShareCaptioner-Video efficiently generates high-quality captions for arbitrary videos.
ShareGPT4Video-8B achieves SOTA results on three video benchmarks.
Abstract
We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media
