MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning

Chengfei Wu; Ronald Seoh; Bingxuan Li; Liqiang Zhang; Fengrong Han; Dan Goldwasser

arXiv:2507.07297·cs.CV·July 11, 2025

MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning

Chengfei Wu, Ronald Seoh, Bingxuan Li, Liqiang Zhang, Fengrong Han, Dan Goldwasser

PDF

Open Access 3 Reviews

TL;DR

MagiC is a new benchmark for evaluating multimodal cognition in vision-language models, focusing on reasoning quality, visual grounding, and robustness, revealing current limitations and guiding future improvements.

Contribution

The paper introduces MagiC, a comprehensive benchmark with new metrics for assessing grounded visual reasoning and model robustness in vision-language models.

Findings

01

Models often rely on superficial patterns rather than true reasoning.

02

Grounding fidelity varies significantly across models.

03

Benchmark reveals key limitations in current multimodal reasoning approaches.

Abstract

Recent advances in large vision-language models have led to impressive performance in visual question answering and multimodal reasoning. However, it remains unclear whether these models genuinely perform grounded visual reasoning or rely on superficial patterns and dataset biases. In this work, we introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Our benchmark includes approximately 5,500 weakly supervised QA examples generated from strong model outputs and 900 human-curated examples with fine-grained annotations, including answers, rationales, and bounding box groundings. We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

* The dataset covers multiple reasoning types with detailed human annotations and saliency-based grounding boxes. The dual-source design (weakly supervised + curated) balances scalability and annotation quality. * Testing 15 diverse models, including both open and proprietary ones, provides a convincing comparative landscape. * The addition of StepSense and Self-Heal metrics is creative and useful. * The paper's main strength lies in its multi dimensional evaluation. The explicit assessment of (

Weaknesses

* The evaluation results shows the result for a single run, without any standard deviations. * The dataset is derived from GQA, which limits its diversity, this may not generalize to broader tasks. * Many metrics rely on a LLM as-a-judge. Without a human agreement study or cross-model validation, it’s unclear how reliable or unbiased these judgements are. - Formulation of MagiScore seems a bit weak and ambiguous. No concrete details are given on how the various components of this score are compu

Reviewer 02Rating 2Confidence 4

Strengths

* It makes sense to evaluate grounding. So collecting reasoning traces makes sense. * The idea of explicit human corrections to evaluate the self-correction capabilities of a model is fun in theory. However, since it measures correcting off-policy behaviour, it is unclear if it really measures ‘self-healing’ behavior. Models typically tend to be very certain of the reasoning steps which they just produced.

Weaknesses

* There are existing papers like LLM-as-a-judge [Judging LLM-as-a-judge with MT-bench and Chatbot Arena, NeurIPS’23] which analyze reasoning steps fully automatically. While I do believe that using ground truth should be better, we at least want to know how good such an approach is. * Gemini 2.5 Pro and GPT-5 yield >85% accuracy on GQA, while GQA is noisy. This suggests that GQA is nearly saturated hence I do not believe that this dataset will be relevant for research much longer. It would be mu

Reviewer 03Rating 6Confidence 4

Strengths

- The benchmark dataset is well designed with broad coverage on different dimensions related to grounded vision cognition. - The proposed metrics align well with the design target. - The analysis of model failure modes is informative.

Weaknesses

- There are already many benchmark datasets for VLMs. This one seems to be object-centric with heavy use of bounding boxes. Discussion on the limitations and how it fits into the broader zoo of benchmarks would be helpful. - The use of LLM judges could introduce biases.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning