VPN: Visual Prompt Navigation

Shuo Feng; Zihan Wang; Yuchen Li; Rui Kong; Hengyi Cai; Shuaiqiang Wang; Gim Hee Lee; Piji Li; Shuqiang Jiang

arXiv:2508.01766·cs.CV·November 25, 2025

VPN: Visual Prompt Navigation

Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang

PDF

Open Access

TL;DR

VPN introduces a visual prompt-based navigation paradigm that uses intuitive visual cues on top-view maps, reducing language ambiguity and improving navigation in complex environments, with new datasets and a specialized baseline network.

Contribution

The paper proposes a novel visual prompt navigation paradigm, creates new datasets, and develops a baseline network with data augmentation strategies for improved navigation performance.

Findings

01

Visual prompts improve navigation accuracy over language instructions.

02

Data augmentation strategies enhance the robustness of VPN models.

03

VPN outperforms baseline methods in new datasets.

Abstract

While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Social Robot Interaction and HRI