OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video   Generation

Kepan Nan; Rui Xie; Penghao Zhou; Tiehan Fan; Zhenheng Yang; Zhijie; Chen; Xiang Li; Jian Yang; Ying Tai

arXiv:2407.02371·cs.CV·February 14, 2025·2 cites

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie, Chen, Xiang Li, Jian Yang, Ying Tai

PDF

Open Access 1 Models 5 Datasets

TL;DR

This paper introduces OpenVid-1M, a high-quality, large-scale dataset for text-to-video generation, along with a novel Multi-modal Video Diffusion Transformer that effectively utilizes textual and visual information.

Contribution

The paper presents a new high-quality dataset for T2V generation and a multi-modal transformer model that better exploits semantic textual information.

Findings

01

OpenVid-1M outperforms previous datasets in T2V tasks.

02

MVDiT effectively captures semantic and structural information.

03

High-definition videos generated using OpenVidHD-0.4M demonstrate improved quality.

Abstract

Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nkp37/OpenVid-1M
model· ♡ 11
♡ 11

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Human Motion and Animation

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Diffusion · Position-Wise Feed-Forward Layer · Adam