ModelScope Text-to-Video Technical Report

Jiuniu Wang; Hangjie Yuan; Dayou Chen; Yingya Zhang; Xiang Wang,; Shiwei Zhang

arXiv:2308.06571·cs.CV·August 15, 2023·47 cites

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang,, Shiwei Zhang

PDF

Open Access 5 Repos 10 Models

TL;DR

ModelScopeT2V is a novel text-to-video synthesis model that extends text-to-image techniques with spatio-temporal blocks, enabling consistent and smooth video generation adaptable to various datasets.

Contribution

It introduces a new text-to-video model with integrated spatio-temporal blocks and a large parameter count, improving over existing methods in quality and flexibility.

Findings

01

Outperforms state-of-the-art methods on key metrics

02

Supports variable frame numbers during training and inference

03

Demonstrates high-quality, smooth video synthesis

Abstract

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Human Motion and Animation