Video-CoE: Reinforcing Video Event Prediction via Chain of Events

Qile Su; Jing Tang; Rui Chen; Lei Sun; Xiangxiang Chu

arXiv:2603.14935·cs.CV·March 17, 2026

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu

PDF

Open Access

TL;DR

This paper introduces the Chain of Events (CoE) paradigm to enhance video event prediction by constructing temporal event chains, improving reasoning and visual understanding in multimodal large language models, and achieving state-of-the-art results.

Contribution

The paper proposes a novel CoE paradigm that constructs event chains to improve logical reasoning and visual content utilization in video event prediction models.

Findings

01

Outperforms existing models on public benchmarks

02

Establishes new state-of-the-art in VEP

03

Enhances reasoning capabilities of MLLMs

Abstract

Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization