VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Zixuan Wang; Yuxin Chen; Yuqi Liu; Jinhui Ye; Pengguang Chen; Changsheng Lu; Shu Liu; Bei Yu; Jiaya Jia

arXiv:2603.22003·cs.RO·May 12, 2026

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Zixuan Wang, Yuxin Chen, Yuqi Liu, Jinhui Ye, Pengguang Chen, Changsheng Lu, Shu Liu, Bei Yu, Jiaya Jia

PDF

2 Repos 2 Models

TL;DR

VP-VLA introduces a structured visual prompting interface that decouples high-level reasoning from low-level control in vision-language-action models, improving spatial precision and robustness.

Contribution

It proposes a dual-system framework with visual prompts for better instruction decomposition and control, outperforming existing end-to-end models.

Findings

01

VP-VLA surpasses state-of-the-art baselines in simulation and real-world tasks.

02

The visual prompting approach improves spatial accuracy and robustness.

03

A novel auxiliary grounding objective enhances low-level control reliability.

Abstract

Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-level execution via a structured visual prompting interface. Specifically, a "System 2 Planner" decomposes complex instructions into sub-tasks and identifies relevant target objects and goal locations. These spatial anchors are rendered directly within the native RGB observation space as modality-consistent visual prompts, such as crosshairs and bounding boxes. This avoids the modality mismatch introduced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.