SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

Siyi Chen; Mikaela Angelina Uy; Chan Hee Song; Faisal Ladhak; Adithyavairavan Murali; Qing Qu; Stan Birchfield; Valts Blukis; Jonathan Tremblay

arXiv:2512.04069·cs.CV·December 4, 2025

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, Jonathan Tremblay

PDF

Open Access

TL;DR

SpaceTools introduces a novel two-phase reinforcement learning framework that enables vision language models to effectively coordinate multiple spatial reasoning tools, significantly improving performance on spatial understanding benchmarks and real-world manipulation tasks.

Contribution

The paper presents Double Interactive Reinforcement Learning (DIRL), a new training method allowing VLMs to learn multi-tool coordination without fixed pipelines, enhancing spatial reasoning capabilities.

Findings

01

Achieves state-of-the-art results on spatial understanding benchmarks.

02

Demonstrates reliable real-world manipulation with a robot.

03

Substantial performance improvements over baseline methods.

Abstract

Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics