What Happens When: Learning Temporal Orders of Events in Videos

Daechul Ahn; Yura Choi; Hyeonbeom Choi; Seongwon Cho; San Kim; Jonghyun Choi

arXiv:2512.08979·cs.CV·December 11, 2025

What Happens When: Learning Temporal Orders of Events in Videos

Daechul Ahn, Yura Choi, Hyeonbeom Choi, Seongwon Cho, San Kim, Jonghyun Choi

PDF

Open Access

TL;DR

This paper evaluates the temporal understanding of Video Large Multimodal Models (VLMMs), introduces a new benchmark VECTOR to assess their ability to recognize event order, and proposes MECOT to improve temporal reasoning in these models.

Contribution

It reveals that current VLMMs struggle with event order understanding, introduces VECTOR for explicit temporal assessment, and proposes MECOT to enhance temporal reasoning capabilities.

Findings

01

VLMMs perform well even with scrambled frames, indicating reliance on prior knowledge.

02

Existing models often fail to understand event sequences in videos.

03

MECOT improves temporal understanding on VECTOR and other benchmarks.

Abstract

Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis