xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed   Representations

Can Qin; Congying Xia; Krithika Ramakrishnan; Michael Ryoo; Lifu Tu,; Yihao Feng; Manli Shu; Honglu Zhou; Anas Awadalla; Jun Wang; Senthil; Purushwalkam; Le Xue; Yingbo Zhou; Huan Wang; Silvio Savarese; Juan Carlos; Niebles; Zeyuan Chen; Ran Xu; Caiming Xiong

arXiv:2408.12590·cs.CV·September 4, 2024

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu,, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil, Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos, Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong

PDF

Open Access

TL;DR

xGen-VideoSyn-1 is a novel text-to-video synthesis model that uses compressed video representations and a divide-and-merge strategy to generate high-quality, long-duration videos efficiently from textual descriptions.

Contribution

The paper introduces VidVAE for video compression, a divide-and-merge approach for temporal consistency, and a Diffusion Transformer for robust video generation, advancing T2V capabilities.

Findings

01

Supports over 14-second 720p video generation

02

Achieves competitive performance against state-of-the-art models

03

Reduces computational costs with VidVAE and divide-and-merge strategy

Abstract

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Absolute Position Encodings