Fostering Video Reasoning via Next-Event Prediction
Haonan Wang, Hongfu Liu, Xiangyan Liu, Chao Du, Kenji Kawaguchi, Ye Wang, Tianyu Pang

TL;DR
This paper introduces next-event prediction (NEP), a self-supervised learning task for multimodal large language models (MLLMs) to improve their temporal reasoning over videos, supported by a new dataset and evaluation benchmark.
Contribution
The paper proposes NEP as a novel self-supervised task for training MLLMs on video reasoning, along with a new dataset V1-33K and the FutureBench benchmark for evaluation.
Findings
NEP effectively enhances temporal reasoning in MLLMs.
The curated dataset V1-33K supports diverse real-world scenarios.
FutureBench provides a reliable metric for future event prediction coherence.
Abstract
Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000…
Peer Reviews
Decision·ICLR 2026 Poster
1. The motivation of the paper is clear and the paper is well written. 2. Different tuning strategies have been explored and results are reported on several existing benchmarks and the proposed benchmark. 3. It shows improvement on temporal reasoning tasks when training with NEP.
1. The generated future captions may reflect language priors (“after running, people usually jump”) rather than visual inference, so the causal reasoning claim remains unsubstantiated to me. It would be great if there is analysis shows whether NEP-trained models actually attend to temporal cues or just leverage textual priors. 2. The authors highlighted "SFT remains a simple yet efficient approach for training on NEP". But from Table 2, SFT didn't seem to be the best strategy? 3. The evaluation
- Originality: It introduces a distinct and underexplored formulation, NEP, bridging video understanding and autoregressive reasoning. - Quality: Demonstrates consistent improvement across multiple reasoning tasks (Table 1–3). - Clarity: The paper provides with well-structured writing and clear motivation. - Significance: The paper shows potential to influence future temporal reasoning research and dataset design, contingent on improved empirical robustness.
- No qualitative visualization of predicted events or linguistic coherence. - No error analysis or failure case study contrasting NEP and baseline models. - No ablation on architecture or training objectives to isolate contribution of NEP loss components.
- The core idea of NEP is sound and innovative. It effectively enhances both visual perception and temporal reasoning through a self-supervised learning paradigm, which also provides a clear advantage in scalability. - Results demonstrate that NEP yields notable improvements on temporal reasoning benchmarks, outperforming standard video QA and captioning. - The comparisons against existing training tasks are well-designed and carefully controlled. - The investigation of multiple training strateg
- Despite NEP’s theoretical scalability, the reported results show that downstream performance saturates with only 5k training samples. This suggests that the current dataset lacks sufficient diversity or scale to fully reveal the benefits of data scaling for NEP, thereby limiting its contribution to the community. - As the authors acknowledge, predicting future events from a past video segment is inherently ambiguous, and the automatically constructed NEP dataset may therefore contain samples t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
