A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya, Zhang, Yujun Shen, Changxin Gao, Nong Sang

TL;DR
This paper introduces TF-T2V, a scalable text-to-video generation framework that learns from unlabeled videos, significantly improving performance by leveraging larger video datasets and reintroducing text labels, with promising results across various paradigms.
Contribution
The paper proposes a novel text-to-video framework that learns from text-free videos, enabling scalable training and improved generation quality compared to previous methods.
Findings
Performance improved with larger video datasets (FID from 9.67 to 8.19)
Reintroducing text labels further enhances quality (FID from 8.19 to 7.64)
Effective on both native and compositional video synthesis
Abstract
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared. Following such a pipeline, we study the effect of doubling the scale of training set (i.e., video-only WebVid10M) with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
