Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs
Isha Chaturvedi, Anjana Nair, Yushen Li, Adhitya Rajendra Kumar, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Vasu Sharma

TL;DR
This paper presents Contrastive Region Masking (CRM), a novel diagnostic tool for analyzing how multimodal large language models rely on visual regions during reasoning, revealing model dependencies and failure modes at each reasoning step.
Contribution
CRM is a training-free, step-level attribution method that systematically masks visual regions to diagnose model reasoning dependencies and failure modes.
Findings
Models vary in reliance on visual cues and robustness to perturbations.
CRM reveals hallucination and grounding behaviors in different models.
Reframes evaluation from answer correctness to reasoning faithfulness.
Abstract
We introduce Contrastive Region Masking (CRM), a training free diagnostic that reveals how multimodal large language models (MLLMs) depend on specific visual regions at each step of chain-of-thought (CoT) reasoning. Unlike prior approaches limited to final answers or attention maps, CRM provides causal, step-level attribution by systematically masking annotated regions and contrasting the resulting reasoning traces with unmasked baselines. Applied to datasets such as VisArgs, CRM reveals distinct failure modes: some models preserve reasoning structure, but hallucinate when evidence is missing, while others ground tightly to visual cues yet collapse under perturbations. By shifting the evaluation from correctness of answers to faithfulness of reasoning, CRM reframes visual benchmarks as diagnostic tools, highlighting the need for multimodal evaluation frameworks that measure not just…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
