What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

Zhaotian Weng; Haoxuan Li; Xin Eric Wang; Kuan-Hao Huang; Jieyu Zhao

arXiv:2506.00869·cs.CL·February 5, 2026

What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

Zhaotian Weng, Haoxuan Li, Xin Eric Wang, Kuan-Hao Huang, Jieyu Zhao

PDF

Open Access 1 Video

TL;DR

This paper evaluates vision-language models' ability to understand causal relationships, revealing significant gaps and proposing new benchmarks and fine-tuning strategies to improve causal reasoning.

Contribution

Introduces VQA-Causal and VCR-Causal benchmarks to specifically assess causal reasoning in VLMs and analyzes training data limitations affecting causal understanding.

Findings

01

VLMs perform poorly on causal reasoning tasks

02

Training datasets lack explicit causal expressions

03

Targeted fine-tuning improves causal reasoning

Abstract

Despite the impressive performance of vision-language models (VLMs) on downstream tasks, their ability to understand and reason about causal relationships in visual inputs remains unclear. Robust causal reasoning is fundamental to solving complex high-level reasoning tasks, yet existing benchmarks often include a mixture of reasoning questions, and VLMs can frequently exploit object recognition and activity identification as shortcuts to arrive at the correct answers, making it challenging to truly assess their causal reasoning abilities. To bridge this gap, we introduce VQA-Causal and VCR-Causal, two new benchmarks specifically designed to isolate and rigorously evaluate VLMs' causal reasoning abilities. Our findings reveal that while VLMs excel in object and activity recognition, they perform poorly on causal reasoning tasks, often only marginally surpassing random guessing. Further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning· underline

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Topic Modeling