InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

Ji Zhang; Shihan Wu; Xu Luo; Hao Wu; Lianli Gao; Heng Tao Shen; Jingkuan Song

arXiv:2505.13888·cs.RO·September 30, 2025

InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning

Ji Zhang, Shihan Wu, Xu Luo, Hao Wu, Lianli Gao, Heng Tao Shen, Jingkuan Song

PDF

Open Access 2 Repos

TL;DR

InSpire enhances vision-language-action models for robotics by improving spatial reasoning, reducing irrelevant visual correlations, and boosting generalization without additional training or data, demonstrated through extensive experiments.

Contribution

The paper introduces InSpire, a simple plugin that improves spatial reasoning in VLAs, addressing spurious correlations and enhancing generalization in robotic tasks.

Findings

01

InSpire significantly improves task success rates in simulation and real-world tests.

02

It effectively reduces reliance on irrelevant visual features.

03

The approach requires no extra training data or large model interactions.

Abstract

Leveraging pretrained Vision-Language Models (VLMs) to map language instruction and visual observations to raw low-level actions, Vision-Language-Action models (VLAs) hold great promise for achieving general-purpose robotic systems. Despite their advancements, existing VLAs tend to spuriously correlate task-irrelevant visual features with actions, limiting their generalization capacity beyond the training data. To tackle this challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet effective approach that mitigates the adverse effects of spurious correlations by boosting the spatial reasoning ability of VLAs. Specifically, InSpire redirects the VLA's attention to task-relevant factors by prepending the question "In which direction is the [object] relative to the robot?" to the language instruction and aligning the answer "right/left/up/down/front/back/grasped" and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Speech and dialogue systems