Vript: A Video Is Worth Thousands of Words

Dongjie Yang; Suyuan Huang; Chengqiang Lu; Xiaodong Han; Haoxin Zhang,; Yan Gao; Yao Hu; Hai Zhao

arXiv:2406.06040·cs.CV·October 28, 2024·2 cites

Vript: A Video Is Worth Thousands of Words

Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang,, Yan Gao, Yao Hu, Hai Zhao

PDF

Open Access 1 Repo 5 Datasets 1 Video

TL;DR

Vript introduces a comprehensive, densely annotated video-text dataset and a novel training paradigm that significantly enhances video captioning and understanding, achieving performance comparable to GPT-4V and establishing new benchmarks for complex video tasks.

Contribution

The paper presents Vript, a large-scale, detailed video-text dataset with script-like captions, and introduces Vriptor, a top-performing video captioning model, along with challenging new benchmarks for video understanding.

Findings

01

Vriptor achieves performance comparable to GPT-4V.

02

Vript-Hard benchmarks are more challenging than existing datasets.

03

The dataset enables detailed video scripting including camera operations.

Abstract

Advancements in multimodal learning, particularly in video understanding and generation, require high-quality video-text datasets for improved model performance. Vript addresses this issue with a meticulously annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips. Each clip has a caption of ~145 words, which is over 10x longer than most video-text datasets. Unlike captions only documenting static content in previous datasets, we enhance video captioning to video scripting by documenting not just the content, but also the camera operations, which include the shot types (medium shot, close-up, etc) and camera movements (panning, tilting, etc). By utilizing the Vript, we explore three training paradigms of aligning more text with the video modality rather than clip-caption pairs. This results in Vriptor, a top-performing video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mutonix/vript
pytorchOfficial

Datasets

Videos

Vript: A Video Is Worth Thousands of Words· slideslive

Taxonomy

TopicsDesign Education and Practice · BIM and Construction Integration · Artistic and Creative Research

MethodsContrastive Language-Image Pre-training