TL;DR
GTA-VLA introduces an interactive framework for embodied reasoning in vision-language-action models, enabling human spatial guidance to improve robot task success, especially under out-of-domain conditions.
Contribution
It presents a novel interactive reasoning approach that incorporates human spatial guidance into the decision-making process of embodied agents.
Findings
Achieves 81.2% success rate on the SimplerEnv WidowX benchmark.
Single visual interactions significantly improve task success under out-of-domain shifts.
Effectively integrates external guidance with internal task planning for better recovery.
Abstract
In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
