Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

Yiran Ling; Qing Lian; Jinghang Li; Qing Jiang; Tianming Zhang; Xiaoke Jiang; Chuanxiu Liu; Jie Liu; Lei Zhang

arXiv:2605.13632·cs.RO·May 14, 2026

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

Yiran Ling, Qing Lian, Jinghang Li, Qing Jiang, Tianming Zhang, Xiaoke Jiang, Chuanxiu Liu, Jie Liu, Lei Zhang

PDF

1 Repo

TL;DR

GTA-VLA introduces an interactive framework for embodied reasoning in vision-language-action models, enabling human spatial guidance to improve robot task success, especially under out-of-domain conditions.

Contribution

It presents a novel interactive reasoning approach that incorporates human spatial guidance into the decision-making process of embodied agents.

Findings

01

Achieves 81.2% success rate on the SimplerEnv WidowX benchmark.

02

Single visual interactions significantly improve task success under out-of-domain shifts.

03

Effectively integrates external guidance with internal task planning for better recovery.

Abstract

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://signalispupupu.github.io/GTA-VLA_ProjPage
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.