TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
Zeqing Wang, Shiyuan Zhang, Chengpei Tang, Keze Wang

TL;DR
This paper introduces TimeCausality, a benchmark to evaluate vision-language models' ability to understand and reason about temporal causality, revealing current models' limitations in this aspect.
Contribution
The paper presents a new benchmark, TimeCausality, specifically designed to assess temporal causal reasoning in vision-language models, highlighting the gap in current models' capabilities.
Findings
Current SOTA open-source VLMs perform poorly on TimeCausality.
GPT-4o shows a performance drop on TimeCausality compared to other tasks.
There is a significant gap between open-source and closed-source models in temporal causal reasoning.
Abstract
Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge (e.g., fruit decay and human aging), is a fundamental aspect of human visual understanding. Unlike temporal perception based on simple event sequences, this form of reasoning requires a deeper comprehension of how object states change over time. Although the current powerful Vision-Language Models (VLMs) have demonstrated impressive performance on a wide range of downstream tasks, their capacity to reason about temporal causality remains underexplored. To address this gap, we introduce \textbf{TimeCausality}, a novel benchmark specifically designed to evaluate the causal reasoning ability of VLMs in the temporal dimension. Based on our TimeCausality, we find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
