What Happens When: Learning Temporal Orders of Events in Videos
Daechul Ahn, Yura Choi, Hyeonbeom Choi, Seongwon Cho, San Kim, Jonghyun Choi

TL;DR
This paper evaluates the temporal understanding of Video Large Multimodal Models (VLMMs), introduces a new benchmark VECTOR to assess their ability to recognize event order, and proposes MECOT to improve temporal reasoning in these models.
Contribution
It reveals that current VLMMs struggle with event order understanding, introduces VECTOR for explicit temporal assessment, and proposes MECOT to enhance temporal reasoning capabilities.
Findings
VLMMs perform well even with scrambled frames, indicating reliance on prior knowledge.
Existing models often fail to understand event sequences in videos.
MECOT improves temporal understanding on VECTOR and other benchmarks.
Abstract
Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
