CascadeV: An Implementation of Wurstchen Architecture for Video Generation
Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan,, Mingyu Guo

TL;DR
CascadeV introduces a cascaded latent diffusion model for high-resolution, high-quality video generation, effectively reducing computational costs and enhancing spatial-temporal consistency, with potential for resolution and frame rate scaling.
Contribution
The paper presents CascadeV, a novel cascaded latent diffusion framework with a spatiotemporal attention mechanism, enabling efficient 2K video generation and scalable resolution and frame rate improvements.
Findings
Achieves state-of-the-art 2K video quality.
Reduces computational demands for high-resolution videos.
Enables 4x increase in resolution or frame rate through cascading.
Abstract
Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimedia Communication and Technology
MethodsSoftmax · Attention Is All You Need · Diffusion · Latent Diffusion Model
