xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu,, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil, Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos, Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong

TL;DR
xGen-VideoSyn-1 is a novel text-to-video synthesis model that uses compressed video representations and a divide-and-merge strategy to generate high-quality, long-duration videos efficiently from textual descriptions.
Contribution
The paper introduces VidVAE for video compression, a divide-and-merge approach for temporal consistency, and a Diffusion Transformer for robust video generation, advancing T2V capabilities.
Findings
Supports over 14-second 720p video generation
Achieves competitive performance against state-of-the-art models
Reduces computational costs with VidVAE and divide-and-merge strategy
Abstract
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Adam · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Absolute Position Encodings
