LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Tianwei Xiong; Yuqing Wang; Daquan Zhou; Zhijie Lin; Jiashi Feng,; Xihui Liu

arXiv:2410.10816·cs.CV·October 15, 2024

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng,, Xihui Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

LVD-2M is a large-scale dataset of over 2 million long, high-quality videos with dense captions, designed to facilitate research in long video generation models.

Contribution

The paper introduces a novel pipeline for selecting high-quality long-take videos and generating dense captions, resulting in the creation of the first large-scale long-take video dataset, LVD-2M.

Findings

01

Validated dataset usefulness by fine-tuning video generation models.

02

Demonstrated dataset's ability to improve long video generation.

03

Provided metrics for assessing video quality.

Abstract

The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

silentview/lvd-2m
noneOfficial

Videos

LVD-2M: A Long-take Video Dataset with Temporally Dense Captions· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging

MethodsSparse Evolutionary Training