Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned
Maeva Guerrier, Karthik Soma, Jana Pavlasek, Giovanni Beltrame

TL;DR
This paper evaluates five state-of-the-art visual navigation models in real-world settings, revealing their limitations in collision avoidance, location discrimination, and robustness to environmental changes.
Contribution
It provides a comprehensive real-world benchmark of VNMs using diverse metrics and highlights key limitations for future research.
Findings
Models frequently collide, showing limited geometric understanding.
Models struggle to distinguish similar locations, causing goal prediction errors.
Performance drops under environmental distribution shifts.
Abstract
Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
