What Happens Next? Next Scene Prediction with a Unified Video Model
Xinjie Li, Zhimin Chen, Rui Zhao, Florian Schiffers, Zhenyu Liao, Vimal Bhat

TL;DR
This paper introduces Next Scene Prediction (NSP), a new task for unified video models that emphasizes temporal and causal reasoning by predicting future scenes from context, advancing multimodal understanding.
Contribution
The paper proposes a novel NSP task and a unified model combining Qwen-VL and LTX, trained on a large-scale dataset with a new reward, to enhance future scene prediction capabilities.
Findings
Achieves state-of-the-art performance on NSP benchmark
Demonstrates improved temporal and causal reasoning in video understanding
Advances generalist multimodal systems' ability to predict future scenes
Abstract
Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
