From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

Honglin He; Yukai Ma; Wayne Wu; Bolei Zhou

arXiv:2507.22028·cs.CV·July 30, 2025

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

Honglin He, Yukai Ma, Wayne Wu, Bolei Zhou

PDF

3 Reviews

TL;DR

This paper introduces the Seeing-to-Experiencing framework that enhances navigation foundation models with reinforcement learning, improving their interactivity, safety, and generalization in real-world urban environments through novel training strategies and comprehensive benchmarking.

Contribution

It presents a new framework combining pretraining on videos with RL-based post-training, along with innovative strategies and a benchmark for evaluating navigation models in photorealistic environments.

Findings

01

RL improves safety and interactivity over offline training alone.

02

The Anchor-Guided Distribution Matching stabilizes learning and models diverse motions.

03

RL-based post-training outperforms supervised fine-tuning in robot navigation tasks.

Abstract

Navigation foundation models trained on massive webscale data enable agents to generalize across diverse environments and embodiments. However, these models trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on videos and post-training through RL. It maintains the generalizability acquired from large-scale real-world videos while enhancing its interactivity through RL in simulation environments. Specifically,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

Comprehensive System: The paper presents an end-to-end system from data collection through real-world deployment, which requires significant engineering effort. Cross-Embodiment Evaluation: Testing on wheeled, quadruped, and humanoid robots (Table 6) demonstrates some generality, though the humanoid results are quite poor. Honest Limitations Discussion: Section 5 acknowledges the limitation of vision-only approaches and collision failures. Detailed Implementation: The appendix provides extens

Weaknesses

1. Lack of Theoretical Insight: The paper doesn't explain why RL helps beyond showing empirical improvements. What specific failure modes of offline learning does RL address? What inductive biases make certain behaviors learnable only through interaction? 2. Modest Improvements: Many reported improvements are marginal (e.g., Table 2: 0.51 vs. 0.32 success rate for wheeled robot). Given the additional computational cost of RL (8 hours on L40S GPU), the cost-benefit tradeoff is unclear. Cherry-Pi

Reviewer 02Rating 8Confidence 3

Strengths

An unusual combination of SFT and RL for learning-based navigation - while existing approaches focus on one or the other, this paper shows the value of doing both.

Weaknesses

The anchor description is not very clear. It seems to be constant-curvature arcs - a clearer explanation is needed. Unclear how the constant curvature arcs approximate the full diversity of demonstration paths - is the matching performed instantaneously (i.e., for an instantaneous vx, w command mapped to the corresponding curvature?)

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper proposes Anchor-Guided Distribution Matching (AGDM), which uses an anchor-guided Gaussian Mixture Model to model multimodal navigation trajectories, capturing diverse valid actions under the same observation while ensuring training stability. 2. This paper designs the Residual-Attention Module (RAM), which freezes pretrained components and adds trainable residual branches to cross-attention layers, enabling the model to gain interactive skills via RL without losing pretrained gene

Weaknesses

1. The model relies solely on visual input and lacks 3D perception capabilities, leading to occasional failure in obstacle avoidance in some scenarios, which is a persistent limitation for vision-only navigation approaches. 2. The real-world evaluation scenarios are relatively limited (only 25 scenarios), and the generalization performance of the S2E framework in more complex and diverse urban environments (e.g., extreme weather, complex traffic conditions) has not been verified. 3. The humano

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.