Vript: A Video Is Worth Thousands of Words
Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang,, Yan Gao, Yao Hu, Hai Zhao

TL;DR
Vript introduces a comprehensive, densely annotated video-text dataset and a novel training paradigm that significantly enhances video captioning and understanding, achieving performance comparable to GPT-4V and establishing new benchmarks for complex video tasks.
Contribution
The paper presents Vript, a large-scale, detailed video-text dataset with script-like captions, and introduces Vriptor, a top-performing video captioning model, along with challenging new benchmarks for video understanding.
Findings
Vriptor achieves performance comparable to GPT-4V.
Vript-Hard benchmarks are more challenging than existing datasets.
The dataset enables detailed video scripting including camera operations.
Abstract
Advancements in multimodal learning, particularly in video understanding and generation, require high-quality video-text datasets for improved model performance. Vript addresses this issue with a meticulously annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips. Each clip has a caption of ~145 words, which is over 10x longer than most video-text datasets. Unlike captions only documenting static content in previous datasets, we enhance video captioning to video scripting by documenting not just the content, but also the camera operations, which include the shot types (medium shot, close-up, etc) and camera movements (panning, tilting, etc). By utilizing the Vript, we explore three training paradigms of aligning more text with the video modality rather than clip-caption pairs. This results in Vriptor, a top-performing video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDesign Education and Practice · BIM and Construction Integration · Artistic and Creative Research
MethodsContrastive Language-Image Pre-training
