CogVideo: Large-scale Pretraining for Text-to-Video Generation via   Transformers

Wenyi Hong; Ming Ding; Wendi Zheng; Xinghan Liu; Jie Tang

arXiv:2205.15868·cs.CV·June 1, 2022·116 cites

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

CogVideo is a large-scale pretrained transformer model for text-to-video generation, leveraging transfer learning from text-to-image models and a hierarchical training strategy to effectively generate videos aligned with textual descriptions.

Contribution

This work introduces CogVideo, the first open-source large-scale pretrained text-to-video model, inheriting knowledge from a text-to-image model and employing a multi-frame-rate training approach.

Findings

01

Outperforms all publicly available models in evaluations.

02

Successfully leverages pretrained text-to-image models for video generation.

03

Demonstrates effective training strategies for complex video-text alignment.

Abstract

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thudm/cogvideo
pytorchOfficial

Models

Videos

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsALIGN