Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt

TL;DR
This paper presents the TempVS benchmark to evaluate multimodal large language models' ability to understand the temporal order of events in image sequences, revealing current models' limitations compared to humans.
Contribution
Introduction of TempVS, a comprehensive benchmark for testing temporal reasoning in multimodal models, along with an evaluation of 38 models showing significant performance gaps.
Findings
Models perform poorly on TempVS tasks compared to humans.
TempVS reveals current limitations in MLLMs' temporal understanding.
Benchmark data and code are publicly available for future research.
Abstract
This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
