Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities

Zhiyuan Li; Heng Wang; Dongnan Liu; Chaoyi Zhang; Ao Ma; Jieting Long; Weidong Cai

arXiv:2408.08105·cs.CV·May 28, 2025·3 cites

Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities

Zhiyuan Li, Heng Wang, Dongnan Liu, Chaoyi Zhang, Ao Ma, Jieting Long, Weidong Cai

PDF

Open Access 1 Repo

TL;DR

This paper introduces MuCR, a benchmark for testing multimodal causal reasoning in large language models, revealing current limitations and proposing a strategy to improve their understanding of visual cues across modalities.

Contribution

The paper presents MuCR, a new benchmark for multimodal causal reasoning, and introduces VcCoT, a strategy to enhance visual cue recognition in large language models.

Findings

01

Current MLLMs underperform in multimodal causal reasoning compared to textual tasks.

02

Identifying visual cues is crucial for cross-modal generalization.

03

VcCoT improves the models' ability to leverage visual information for causal inference.

Abstract

Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust causal inference across both text and vision? Motivated by these, we introduce MuCR - a novel Multimodal Causal Reasoning benchmark that leverages synthetic siamese images and text pairs to challenge MLLMs. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess MLLMs' comprehension abilities. Our experiments reveal that current MLLMs fall short in multimodal causal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhiyuan-li-john/mucr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques