Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The   answer is No!

Mohamed Fazli Imam; Chenyang Lyu; Alham Fikri Aji

arXiv:2501.10674·cs.CV·February 19, 2025

Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!

Mohamed Fazli Imam, Chenyang Lyu, Alham Fikri Aji

PDF

Open Access 1 Datasets

TL;DR

This paper evaluates the ability of multimodal large language models to understand and reason about visual temporal information, revealing significant limitations and the need for further development in this area.

Contribution

The paper introduces the TemporalVQA benchmark to assess MLLMs' visual temporal understanding and demonstrates their current shortcomings in temporal reasoning tasks.

Findings

01

GPT-4o achieved 49.1% accuracy in temporal order understanding.

02

GPT-4o achieved 70% accuracy in time-lapse estimation.

03

Open-source models performed poorly on both tasks.

Abstract

Multimodal Large Language Models (MLLMs) have achieved significant advancements in tasks like Visual Question Answering (VQA) by leveraging foundational Large Language Models (LLMs). However, their abilities in specific areas such as visual temporal understanding, which is crucial for comprehending real-world dynamics, remain underexplored. To address this, we propose a challenging evaluation benchmark named TemporalVQA, consisting of two parts: 1) Temporal Order Understanding and 2) Time-lapse Estimation. The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames. The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years. Our evaluations of advanced MLLMs, including models like GPT-4o and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

fazliimam/temporal-vqa
dataset· 30 dl
30 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems