LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Tianwei Xiong, Yuqing Wang, Daquan Zhou, Zhijie Lin, Jiashi Feng,, Xihui Liu

TL;DR
LVD-2M is a large-scale dataset of over 2 million long, high-quality videos with dense captions, designed to facilitate research in long video generation models.
Contribution
The paper introduces a novel pipeline for selecting high-quality long-take videos and generating dense captions, resulting in the creation of the first large-scale long-take video dataset, LVD-2M.
Findings
Validated dataset usefulness by fine-tuning video generation models.
Demonstrated dataset's ability to improve long video generation.
Provided metrics for assessing video quality.
Abstract
The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging
MethodsSparse Evolutionary Training
