LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
Jianxiong Gao, Zhaoxi Chen, Xian Liu, Jianfeng Feng, Chenyang Si, Yanwei Fu, Yu Qiao, Ziwei Liu

TL;DR
LongVie is an innovative autoregressive framework for controllable ultra-long video generation that ensures temporal consistency and visual quality by integrating multi-modal controls and a new benchmark.
Contribution
It introduces a unified noise initialization, global control normalization, multi-modal guidance, and a degradation-aware training strategy for long video synthesis.
Findings
Achieves state-of-the-art controllability and consistency
Maintains high visual quality over long videos
Introduces LongVGenBench benchmark with 100 videos
Abstract
Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Advanced Vision and Imaging
