M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, Wanxiang Che

TL;DR
This paper introduces M$^3$CoT, a comprehensive benchmark for multi-domain, multi-step, multi-modal reasoning in Chain-of-Thought, addressing current limitations and evaluating Vision Large Language Models' reasoning capabilities.
Contribution
It presents the first benchmark covering multi-domain, multi-step, and multi-modal reasoning in Chain-of-Thought, and evaluates VLLMs, highlighting existing gaps and challenges.
Findings
Current VLLMs struggle with M$^3$CoT reasoning.
Significant gap between VLLMs and human performance.
M$^3$CoT serves as a new resource for research.
Abstract
Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning, which gains increasing attention. Nevertheless, the current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing, thereby hindering the development of MCoT. Motivated by this, we introduce a novel benchmark (MCoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT. Additionally, we conduct a thorough evaluation involving abundant MCoT approaches on Vision Large Language Models (VLLMs). In addition, we highlight that the current VLLMs still struggle to correctly reason in MCoT and there remains a large gap between existing VLLMs and human performance in MCoT, despite their superior results on previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Complex Network Analysis Techniques
