Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Zhiwu Qing; Shiwei Zhang; Jiayu Wang; Xiang Wang; Yujie Wei; Yingya; Zhang; Changxin Gao; Nong Sang

arXiv:2312.04483·cs.CV·December 8, 2023·1 cites

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya, Zhang, Changxin Gao, Nong Sang

PDF

Open Access 1 Repo 4 Models

TL;DR

HiGen introduces a hierarchical decoupling approach for text-to-video generation, separating spatial and temporal factors at structure and content levels to improve realism, diversity, and stability of generated videos.

Contribution

The paper proposes a novel hierarchical decoupling framework that separates spatial and temporal reasoning, enhancing the quality and stability of text-to-video synthesis.

Findings

01

Outperforms state-of-the-art T2V methods in quality and stability

02

Effectively reduces complexity of video generation task

03

Generates semantically accurate and temporally stable videos

Abstract

Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ali-vilab/VGen
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Video Analysis and Summarization

MethodsDiffusion