TL;DR
This paper explores the limitations of large language models in spatial reasoning tasks and finds that current models lack essential visual-spatial primitives, even when augmented with external imagery modules.
Contribution
It demonstrates that equipping LLMs with an external imagery module does not significantly improve spatial reasoning, highlighting fundamental perceptual and reasoning gaps.
Findings
Performance on 3D rotation tasks maxed at 62.5% accuracy.
External imagery modules do not overcome core spatial reasoning limitations.
Current models lack low-level spatial signals and dynamic visual reasoning capabilities.
Abstract
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
