Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian, Tanuja Ganu

TL;DR
This paper reveals that Chain-of-Thought prompting impairs visual spatial reasoning in multimodal models, highlighting the need for vision-focused approaches due to observed hallucinations and shortcut learning.
Contribution
It provides a comprehensive evaluation showing CoT degrades spatial reasoning and introduces No-Image++ to analyze hallucination and shortcut learning in multimodal models.
Findings
CoT prompting consistently reduces performance in visual spatial reasoning tasks.
MRMs and MLMs hallucinate visual details from text even without images.
Shortcut learning is severe in models when reasoning without actual visual input.
Abstract
Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
