The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang

TL;DR
This paper reveals that current multimodal large language models rely on a Cartesian shortcut in visual reasoning tasks, which can be broken by reformulating tasks in Polar coordinates, exposing their lack of topology-invariant reasoning.
Contribution
The authors introduce Polaris-Bench, a benchmark reformulating visual reasoning tasks in Polar coordinates to evaluate and expose the reliance of models on Cartesian shortcuts.
Findings
Models' performance drops from 70-83% to 31-39% on Polar layouts.
Reasoning improvements on Cartesian layouts do not transfer to Polar equivalents.
Current models lack topology-invariant visual reasoning capabilities.
Abstract
As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the \textbf{Cartesian Shortcut}: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce \textbf{Polaris-Bench}, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
