Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Discern Causal Links Across Modalities
Zhiyuan Li, Heng Wang, Dongnan Liu, Chaoyi Zhang, Ao Ma, Jieting Long, Weidong Cai

TL;DR
This paper introduces MuCR, a benchmark for testing multimodal causal reasoning in large language models, revealing current limitations and proposing a strategy to improve their understanding of visual cues across modalities.
Contribution
The paper presents MuCR, a new benchmark for multimodal causal reasoning, and introduces VcCoT, a strategy to enhance visual cue recognition in large language models.
Findings
Current MLLMs underperform in multimodal causal reasoning compared to textual tasks.
Identifying visual cues is crucial for cross-modal generalization.
VcCoT improves the models' ability to leverage visual information for causal inference.
Abstract
Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust causal inference across both text and vision? Motivated by these, we introduce MuCR - a novel Multimodal Causal Reasoning benchmark that leverages synthetic siamese images and text pairs to challenge MLLMs. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess MLLMs' comprehension abilities. Our experiments reveal that current MLLMs fall short in multimodal causal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
