SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization
Zhengcheng Wang, Zichuan Lin, Yijun Yang, Haobo Fu, Deheng Ye

TL;DR
This paper introduces SeeNav-Agent, a novel VLN framework that employs visual prompts and step-level reinforcement fine-tuning to significantly improve navigation success rates and training stability on benchmark datasets.
Contribution
The paper proposes a dual-view visual prompt technique and a step-level reinforcement policy optimization method, advancing perception and planning in vision-language navigation models.
Findings
Achieved 86.7% success rate with GPT-4.1 on EmbodiedBench, surpassing previous models by 20 pp.
Qwen2.5-VL-3B reached 72.3% success rate, outperforming existing models by 5.6 pp.
SRGPO improves training stability, convergence speed, and model generalization.
Abstract
Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent's understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Reinforcement Learning in Robotics
