ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access

Timing Yang; Sucheng Ren; Alan Yuille; and Feng Wang

arXiv:2511.18382·cs.CV·November 25, 2025

ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access

Timing Yang, Sucheng Ren, Alan Yuille, and Feng Wang

PDF

Open Access 1 Datasets

TL;DR

ViMix-14M is a large, high-quality, crawl-free video-text dataset with long-form captions, designed to facilitate open-source text-to-video generation and related tasks by addressing data accessibility and quality issues.

Contribution

This work introduces ViMix-14M, a curated, multi-source, high-quality video-text dataset with long-form captions and crawl-free access, filling a critical gap in open-source video-text data resources.

Findings

01

Improves multimodal retrieval performance

02

Enhances text-to-video generation quality

03

Boosts video question answering accuracy

Abstract

Text-to-video generation has surged in interest since Sora, yet open-source models still face a data bottleneck: there is no large, high-quality, easily obtainable video-text corpus. Existing public datasets typically require manual YouTube crawling, which yields low usable volume due to link rot and access limits, and raises licensing uncertainty. This work addresses this challenge by introducing ViMix-14M, a curated multi-source video-text dataset of around 14 million pairs that provides crawl-free, download-ready access and long-form, high-quality captions tightly aligned to video. ViMix-14M is built by merging diverse open video sources, followed by unified de-duplication and quality filtering, and a multi-granularity, ground-truth-guided re-captioning pipeline that refines descriptions to better match actions, scenes, and temporal structure. We evaluate the dataset by multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TimingYang/ViMix-14M
dataset· 36 dl
36 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling