SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation
Ruisen Tu, Arth Shukla, Sohyun Yoo, Xuanlin Li, Junxi Li, Jianwen Xie, Hao Su, Zhuowen Tu

TL;DR
This paper introduces SG-VLA, a spatially-grounded vision-language-action model for mobile manipulation that enhances perception and control in household environments through auxiliary tasks and multi-modal inputs, leading to improved robotic task performance.
Contribution
The paper presents a novel framework that integrates auxiliary decoders and multi-view inputs to develop spatially-grounded, manipulation-aware representations for complex robotic tasks.
Findings
Significant improvements in household manipulation tasks.
Enhanced spatial understanding from multi-view and depth cues.
Robust control over high-dimensional action space.
Abstract
Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that strengthens perception and representation through auxiliary task co-training and multi-modal input enhancement. Our method addresses the challenge of controlling a 13-dimensional action space involving coordinated base motion, arm articulation, and gripper actuation. To enrich spatial understanding, the model incorporates multi-view RGB observations, depth cues, and short temporal history, providing perspectives of both global scene structure and local manipulation context. To improve representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Social Robot Interaction and HRI
