A Simple Text to Video Model via Transformer

Gang Chen

arXiv:2309.14683·cs.CV·September 27, 2023

A Simple Text to Video Model via Transformer

Gang Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a straightforward Transformer-based model for text-to-video generation, encoding text and images into a shared space, and utilizing U-Net for image reconstruction, demonstrating promising results on UCF101.

Contribution

The paper proposes a simple, unified Transformer framework for text-to-video synthesis that incorporates U-Net for image reconstruction and motion constraints.

Findings

01

Effective text-to-video generation demonstrated on UCF101

02

U-Net enhances image reconstruction in long sequences

03

Model captures temporal consistency in generated videos

Abstract

We present a general and simple text to video model based on Transformer. Since both text and video are sequential data, we encode both texts and images into the same hidden space, which are further fed into Transformer to capture the temporal consistency and then decoder to generate either text or images. Considering the image signal may become weak in the long sequence, we introduce the U-Net to reconstruct image from its noised version. Specifically, we increase the noise level to the original image in the long sequence, then use the $d o w n$ module from U-Net to encode noised images, which are further input to transformer to predict next clear images. We also add a constraint to promote motion between any generated image pair in the video. We use GPT2 and test our approach on UCF101 dataset and show it can generate promising videos.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vividitytech/text2videogpt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Computational Physics and Python Applications

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Concatenated Skip Connection · Layer Normalization · Label Smoothing · Convolution