MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions
Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing, Zeng, Yu Xiong, Qiang Xu, Ying Shan

TL;DR
MiraData is a large, high-quality video dataset with long durations and detailed structured captions, designed to improve video generation and evaluation, especially for high-motion, long-duration videos.
Contribution
The paper introduces MiraData, a novel dataset with longer videos and detailed captions, and MiraBench, an enhanced benchmark with new metrics for assessing motion and temporal consistency.
Findings
MiraData outperforms existing datasets in video duration and caption detail.
MiraBench provides comprehensive metrics including 3D consistency and motion strength.
Experiments show MiraDiT benefits from MiraData, especially in motion quality.
Abstract
Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention. However, existing publicly available datasets are inadequate for generating Sora-like videos, as they mainly contain short videos with low motion intensity and brief captions. To address these issues, we propose MiraData, a high-quality video dataset that surpasses previous ones in video duration, caption detail, motion strength, and visual quality. We curate MiraData from diverse, manually selected sources and meticulously process the data to obtain semantically consistent clips. GPT-4V is employed to annotate structured captions, providing detailed descriptions from four different perspectives along with a summarized dense caption. To better assess temporal consistency and motion intensity in video generation, we introduce MiraBench,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
