ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang,, Shiwei Zhang

TL;DR
ModelScopeT2V is a novel text-to-video synthesis model that extends text-to-image techniques with spatio-temporal blocks, enabling consistent and smooth video generation adaptable to various datasets.
Contribution
It introduces a new text-to-video model with integrated spatio-temporal blocks and a large parameter count, improving over existing methods in quality and flexibility.
Findings
Outperforms state-of-the-art methods on key metrics
Supports variable frame numbers during training and inference
Demonstrates high-quality, smooth video synthesis
Abstract
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ali-vilab/text-to-video-ms-1.7bmodel· 9.9k dl· ♡ 6539.9k dl♡ 653
- 🤗ali-vilab/i2vgen-xlmodel· 1.6k dl· ♡ 1831.6k dl♡ 183
- 🤗vdo/i2vgen-xlmodel· ♡ 2♡ 2
- 🤗longlian/text-to-video-lvd-msmodel· 29 dl· ♡ 329 dl♡ 3
- 🤗longlian/text-to-video-lvd-zsmodel· 21 dl· ♡ 321 dl♡ 3
- 🤗kylielee505/myttvlnsmodel
- 🤗mailongwu/text-to-video-ms-1-7b-fnvmodel· 2 dl2 dl
- 🤗Suleman201/i2vgen-xlmodel· 5 dl5 dl
- 🤗isfs/i2vgen-xl-fp16model· 23 dl23 dl
- 🤗aaftabazad612/text-to-video-ms-1.7bmodel· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Human Motion and Animation
