LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Jianxiong Gao; Zhaoxi Chen; Xian Liu; Junhao Zhuang; Chengming Xu; Jianfeng Feng; Yu Qiao; Yanwei Fu; Chenyang Si; Ziwei Liu

arXiv:2512.13604·cs.CV·December 16, 2025

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Jianxiong Gao, Zhaoxi Chen, Xian Liu, Junhao Zhuang, Chengming Xu, Jianfeng Feng, Yu Qiao, Yanwei Fu, Chenyang Si, Ziwei Liu

PDF

Open Access 1 Models

TL;DR

LongVie 2 is a novel multimodal autoregressive framework for controllable, long-term, high-quality video generation that advances the state-of-the-art in video world modeling.

Contribution

It introduces a three-stage training process incorporating multi-modal guidance, degradation-aware training, and history-context guidance for improved controllability and temporal consistency.

Findings

01

Achieves state-of-the-art long-range controllability and coherence.

02

Supports continuous video generation up to five minutes.

03

Introduces LongVGenBench benchmark for diverse high-resolution videos.

Abstract

Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Vchitect/LongVie2
model· ♡ 26
♡ 26

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning