ShareGPT4Video: Improving Video Understanding and Generation with Better   Captions

Lin Chen; Xilin Wei; Jinsong Li; Xiaoyi Dong; Pan Zhang; Yuhang Zang,; Zehui Chen; Haodong Duan; Bin Lin; Zhenyu Tang; Li Yuan; Yu Qiao; Dahua Lin,; Feng Zhao; Jiaqi Wang

arXiv:2406.04325·cs.CV·June 7, 2024·3 cites

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang,, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin,, Feng Zhao, Jiaqi Wang

PDF

Open Access 1 Repo 2 Models 3 Datasets 1 Video

TL;DR

This paper introduces ShareGPT4Video, a comprehensive series including a large dataset, a captioning model, and an LVLM, all designed to enhance video understanding and generation through dense, precise captions and a novel captioning strategy.

Contribution

It presents a new high-quality video captioning dataset, a scalable captioning model, and an LVLM that achieves state-of-the-art performance on video benchmarks, addressing key challenges in temporal and detailed content understanding.

Findings

01

ShareGPT4Video dataset contains 40K annotated videos with rich captions.

02

ShareCaptioner-Video efficiently generates high-quality captions for arbitrary videos.

03

ShareGPT4Video-8B achieves SOTA results on three video benchmarks.

Abstract

We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sharegpt4omni/sharegpt4video
pytorch

Models

Datasets

Videos

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Subtitles and Audiovisual Media