TL;DR
VP-VLA introduces a structured visual prompting interface that decouples high-level reasoning from low-level control in vision-language-action models, improving spatial precision and robustness.
Contribution
It proposes a dual-system framework with visual prompts for better instruction decomposition and control, outperforming existing end-to-end models.
Findings
VP-VLA surpasses state-of-the-art baselines in simulation and real-world tasks.
The visual prompting approach improves spatial accuracy and robustness.
A novel auxiliary grounding objective enhances low-level control reliability.
Abstract
Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are rendered directly within the native RGB observation space as modality-consistent visual prompts, such as crosshairs and bounding boxes. This avoids the modality mismatch introduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
