Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?
Nitesh Subedi, Adam Haroon, Shreyan Ganguly, Samuel T.K. Tetteh, Prajwal Koirala, Cody Fleming, Soumik Sarkar

TL;DR
This paper investigates whether pretrained vision-language embeddings alone can guide robot navigation without additional training, revealing their strengths in language grounding but limitations in planning and spatial reasoning.
Contribution
It introduces a minimalist framework that trains a behavior cloning policy directly on frozen embeddings, providing an empirical baseline for foundation models in navigation tasks.
Findings
Achieves 74% success rate in navigation to language targets
Pretrained embeddings support basic language grounding
Struggle with long-horizon planning and spatial reasoning
Abstract
Foundation models have revolutionized robotics by providing rich semantic representations without task-specific training. While many approaches integrate pretrained vision-language models (VLMs) with specialized navigation architectures, the fundamental question remains: can these pretrained embeddings alone successfully guide navigation without additional fine-tuning or specialized modules? We present a minimalist framework that decouples this question by training a behavior cloning policy directly on frozen vision-language embeddings from demonstrations collected by a privileged expert. Our approach achieves a 74% success rate in navigation to language-specified targets, compared to 100% for the state-aware expert, though requiring 3.2 times more steps on average. This performance gap reveals that pretrained embeddings effectively support basic language grounding but struggle with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
