ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access
Timing Yang, Sucheng Ren, Alan Yuille, and Feng Wang

TL;DR
ViMix-14M is a large, high-quality, crawl-free video-text dataset with long-form captions, designed to facilitate open-source text-to-video generation and related tasks by addressing data accessibility and quality issues.
Contribution
This work introduces ViMix-14M, a curated, multi-source, high-quality video-text dataset with long-form captions and crawl-free access, filling a critical gap in open-source video-text data resources.
Findings
Improves multimodal retrieval performance
Enhances text-to-video generation quality
Boosts video question answering accuracy
Abstract
Text-to-video generation has surged in interest since Sora, yet open-source models still face a data bottleneck: there is no large, high-quality, easily obtainable video-text corpus. Existing public datasets typically require manual YouTube crawling, which yields low usable volume due to link rot and access limits, and raises licensing uncertainty. This work addresses this challenge by introducing ViMix-14M, a curated multi-source video-text dataset of around 14 million pairs that provides crawl-free, download-ready access and long-form, high-quality captions tightly aligned to video. ViMix-14M is built by merging diverse open video sources, followed by unified de-duplication and quality filtering, and a multi-granularity, ground-truth-guided re-captioning pipeline that refines descriptions to better match actions, scenes, and temporal structure. We evaluate the dataset by multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
