SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Ruisen Tu; Arth Shukla; Sohyun Yoo; Xuanlin Li; Junxi Li; Jianwen Xie; Hao Su; Zhuowen Tu

arXiv:2603.22760·cs.RO·March 25, 2026

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Ruisen Tu, Arth Shukla, Sohyun Yoo, Xuanlin Li, Junxi Li, Jianwen Xie, Hao Su, Zhuowen Tu

PDF

Open Access

TL;DR

This paper introduces SG-VLA, a spatially-grounded vision-language-action model for mobile manipulation that enhances perception and control in household environments through auxiliary tasks and multi-modal inputs, leading to improved robotic task performance.

Contribution

The paper presents a novel framework that integrates auxiliary decoders and multi-view inputs to develop spatially-grounded, manipulation-aware representations for complex robotic tasks.

Findings

01

Significant improvements in household manipulation tasks.

02

Enhanced spatial understanding from multi-view and depth cues.

03

Robust control over high-dimensional action space.

Abstract

Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that strengthens perception and representation through auxiliary task co-training and multi-modal input enhancement. Our method addresses the challenge of controlling a 13-dimensional action space involving coordinated base motion, arm articulation, and gripper actuation. To enrich spatial understanding, the model incorporates multi-view RGB observations, depth cues, and short temporal history, providing perspectives of both global scene structure and local manipulation context. To improve representation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Social Robot Interaction and HRI