SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

Zhengcheng Wang; Zichuan Lin; Yijun Yang; Haobo Fu; Deheng Ye

arXiv:2512.02631·cs.LG·December 3, 2025

SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

Zhengcheng Wang, Zichuan Lin, Yijun Yang, Haobo Fu, Deheng Ye

PDF

Open Access 1 Models

TL;DR

This paper introduces SeeNav-Agent, a novel VLN framework that employs visual prompts and step-level reinforcement fine-tuning to significantly improve navigation success rates and training stability on benchmark datasets.

Contribution

The paper proposes a dual-view visual prompt technique and a step-level reinforcement policy optimization method, advancing perception and planning in vision-language navigation models.

Findings

01

Achieved 86.7% success rate with GPT-4.1 on EmbodiedBench, surpassing previous models by 20 pp.

02

Qwen2.5-VL-3B reached 72.3% success rate, outperforming existing models by 5.6 pp.

03

SRGPO improves training stability, convergence speed, and model generalization.

Abstract

Existing Vision-Language Navigation (VLN) agents based on Large Vision-Language Models (LVLMs) often suffer from perception errors, reasoning errors, and planning errors, which significantly hinder their navigation performance. To address these limitations, a novel VLN agent framework, named SeeNav-Agent, is proposed in this work. First, to reduce perception hallucinations of the visual module of the VLN agent, a dual-view Visual Prompt (VP) technique is introduced in the input space, which can also improve the agent's understanding of current spatial states. Subsequently, a novel step-level Reinforcement Fine-Tuning (RFT) method, Step Reward Group Policy Optimization (SRGPO), is designed for the post-training of VLN agents. In SRGPO, we first define verifiable process rewards for the navigation task, and then perform efficient step-level advantage estimation by randomly grouping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
wangzc9865/SeeNav-Agent
model· 9 dl· ♡ 1
9 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Reinforcement Learning in Robotics