A Simple Text to Video Model via Transformer
Gang Chen

TL;DR
This paper introduces a straightforward Transformer-based model for text-to-video generation, encoding text and images into a shared space, and utilizing U-Net for image reconstruction, demonstrating promising results on UCF101.
Contribution
The paper proposes a simple, unified Transformer framework for text-to-video synthesis that incorporates U-Net for image reconstruction and motion constraints.
Findings
Effective text-to-video generation demonstrated on UCF101
U-Net enhances image reconstruction in long sequences
Model captures temporal consistency in generated videos
Abstract
We present a general and simple text to video model based on Transformer. Since both text and video are sequential data, we encode both texts and images into the same hidden space, which are further fed into Transformer to capture the temporal consistency and then decoder to generate either text or images. Considering the image signal may become weak in the long sequence, we introduce the U-Net to reconstruct image from its noised version. Specifically, we increase the noise level to the original image in the long sequence, then use the module from U-Net to encode noised images, which are further input to transformer to predict next clear images. We also add a constraint to promote motion between any generated image pair in the video. We use GPT2 and test our approach on UCF101 dataset and show it can generate promising videos.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Computational Physics and Python Applications
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Concatenated Skip Connection · Layer Normalization · Label Smoothing · Convolution
