pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
Zhanpeng Luo, Ce Zhang, Silong Yong, Cunxi Dai, Qianwei Wang, Haoxi Ran, Guanya Shi, Katia Sycara, Yaqi Xie

TL;DR
pySpatial enables large language models to perform zero-shot 3D spatial reasoning by generating Python code that interfaces with spatial tools, transforming 2D inputs into 3D scenes for explicit reasoning.
Contribution
We introduce pySpatial, a zero-shot visual programming framework that enhances MLLMs with 3D spatial reasoning capabilities without fine-tuning.
Findings
Outperforms baseline MLLMs on MindCube and Omni3D-Bench benchmarks.
Achieves 12.94% higher accuracy than GPT-4.1-mini on MindCube.
Enables real-world indoor navigation using generated route plans.
Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks…
Peer Reviews
Decision·ICLR 2026 Poster
1. Novel and Well-Motivated Problem Formulation The paper addresses a clearly identified limitation in current MLLMs regarding 3D spatial reasoning from limited views (Section 1, lines 54-65). While recent works like SpatialVLM and SpatialRGPT focus on single-view spatial understanding, this work tackles the more challenging multi-view setting where models must reason across perspectives. The visual programming paradigm is well-suited to this problem, allowing flexible composition of spatial to
1. Limited and Non-Rigorous Real-World Validation While Section 4.4 presents robot navigation as evidence of practical effectiveness, the evaluation is limited: (a) Qualitative only: No quantitative success rates, path efficiency metrics, or safety margins are reported; (b) Single environment: Testing appears confined to one 50m² two-room laboratory; (c) Manual intervention: High-level position commands are "manually converted into temporal velocity targets" (lines 375-377), reducing the autonom
In general the paper is written quite clearly. The method is well described, and Figure 1 and is quite clear in terms of describing the differences with most spatial mental models and pySpatial. The model is compared on several datasets (MindCube, Omni3D-Bench) to show the effectiveness. The problem of spatial reasoning with MLLMs is also an important and relevant task in the community.
I have some questions on the actual effectiveness of the visual program set up. From my understanding, and from all the results shown in the paper, pySpatial lists out the procedures of calling external APIs. It does not do additional complex actions (e.g., loops, if/else, etc) beyond a sequence of API calling. Are there cases where the question answer requires more than a linear sequence of API calling and if so can we see several of these examples? If not, this makes me question whether explic
1. The paper clearly identifies a relevant gap between current MLLMs’ implicit, imagination-based spatial reasoning and the need for explicit geometric grounding. The motivation is reasonable and reflects an active research direction in improving spatial understanding for embodied and multi-view settings. 2. The proposed visual programming framework is well-structured and methodologically sound. Its modular API design offers a clear and interpretable mechanism for integrating 3D reasoning tools
1. While the integration of 3D spatial reasoning within a visual programming framework is well-executed, the core concept of using generated Python code as an intermediate reasoning layer is not entirely novel. Prior works such as VisProg, ViperGPT, and VADAR have explored similar paradigms for visual reasoning. The novelty here primarily lies in extending this paradigm to 3D tools rather than introducing a fundamentally new reasoning mechanism. Consequently, the conceptual contribution may be p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
