Video-CoE: Reinforcing Video Event Prediction via Chain of Events
Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu

TL;DR
This paper introduces the Chain of Events (CoE) paradigm to enhance video event prediction by constructing temporal event chains, improving reasoning and visual understanding in multimodal large language models, and achieving state-of-the-art results.
Contribution
The paper proposes a novel CoE paradigm that constructs event chains to improve logical reasoning and visual content utilization in video event prediction models.
Findings
Outperforms existing models on public benchmarks
Establishes new state-of-the-art in VEP
Enhances reasoning capabilities of MLLMs
Abstract
Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
