Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya, Zhang, Changxin Gao, Nong Sang

TL;DR
HiGen introduces a hierarchical decoupling approach for text-to-video generation, separating spatial and temporal factors at structure and content levels to improve realism, diversity, and stability of generated videos.
Contribution
The paper proposes a novel hierarchical decoupling framework that separates spatial and temporal reasoning, enhancing the quality and stability of text-to-video synthesis.
Findings
Outperforms state-of-the-art T2V methods in quality and stability
Effectively reduces complexity of video generation task
Generates semantically accurate and temporally stable videos
Abstract
Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Video Analysis and Summarization
MethodsDiffusion
