Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework
Nitya Tiwari, Parv Maheshwari, Vidisha Agarwal

TL;DR
This paper evaluates the generalizability of multimodal chain-of-thought reasoning across diverse datasets, analyzing how vision features and rationale quality influence reasoning accuracy and hallucination reduction.
Contribution
It systematically assesses the effectiveness of a two-stage multimodal CoT framework across multiple datasets, highlighting domain-specific challenges and insights for future improvements.
Findings
Vision features reduce hallucinations in rationale generation
Effectiveness of CoT varies across question types
Commonsense reasoning remains challenging
Abstract
While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based language models. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
