CascadeV: An Implementation of Wurstchen Architecture for Video   Generation

Wenfeng Lin; Jiangchuan Wei; Boyuan Liu; Yichen Zhang; Shiyue Yan,; Mingyu Guo

arXiv:2501.16612·cs.CV·January 29, 2025

CascadeV: An Implementation of Wurstchen Architecture for Video Generation

Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan,, Mingyu Guo

PDF

Open Access 1 Repo

TL;DR

CascadeV introduces a cascaded latent diffusion model for high-resolution, high-quality video generation, effectively reducing computational costs and enhancing spatial-temporal consistency, with potential for resolution and frame rate scaling.

Contribution

The paper presents CascadeV, a novel cascaded latent diffusion framework with a spatiotemporal attention mechanism, enabling efficient 2K video generation and scalable resolution and frame rate improvements.

Findings

01

Achieves state-of-the-art 2K video quality.

02

Reduces computational demands for high-resolution videos.

03

Enables 4x increase in resolution or frame rate through cascading.

Abstract

Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/cascadev
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimedia Communication and Technology

MethodsSoftmax · Attention Is All You Need · Diffusion · Latent Diffusion Model