What Happens Next? Next Scene Prediction with a Unified Video Model

Xinjie Li; Zhimin Chen; Rui Zhao; Florian Schiffers; Zhenyu Liao; Vimal Bhat

arXiv:2512.13015·cs.CV·December 16, 2025

What Happens Next? Next Scene Prediction with a Unified Video Model

Xinjie Li, Zhimin Chen, Rui Zhao, Florian Schiffers, Zhenyu Liao, Vimal Bhat

PDF

Open Access

TL;DR

This paper introduces Next Scene Prediction (NSP), a new task for unified video models that emphasizes temporal and causal reasoning by predicting future scenes from context, advancing multimodal understanding.

Contribution

The paper proposes a novel NSP task and a unified model combining Qwen-VL and LTX, trained on a large-scale dataset with a new reward, to enhance future scene prediction capabilities.

Findings

01

Achieves state-of-the-art performance on NSP benchmark

02

Demonstrates improved temporal and causal reasoning in video understanding

03

Advances generalist multimodal systems' ability to predict future scenes

Abstract

Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning