Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

Nitesh Subedi; Adam Haroon; Shreyan Ganguly; Samuel T.K. Tetteh; Prajwal Koirala; Cody Fleming; Soumik Sarkar

arXiv:2506.14507·cs.RO·June 18, 2025

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

Nitesh Subedi, Adam Haroon, Shreyan Ganguly, Samuel T.K. Tetteh, Prajwal Koirala, Cody Fleming, Soumik Sarkar

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether pretrained vision-language embeddings alone can guide robot navigation without additional training, revealing their strengths in language grounding but limitations in planning and spatial reasoning.

Contribution

It introduces a minimalist framework that trains a behavior cloning policy directly on frozen embeddings, providing an empirical baseline for foundation models in navigation tasks.

Findings

01

Achieves 74% success rate in navigation to language targets

02

Pretrained embeddings support basic language grounding

03

Struggle with long-horizon planning and spatial reasoning

Abstract

Foundation models have revolutionized robotics by providing rich semantic representations without task-specific training. While many approaches integrate pretrained vision-language models (VLMs) with specialized navigation architectures, the fundamental question remains: can these pretrained embeddings alone successfully guide navigation without additional fine-tuning or specialized modules? We present a minimalist framework that decouples this question by training a behavior cloning policy directly on frozen vision-language embeddings from demonstrations collected by a privileged expert. Our approach achieves a 74% success rate in navigation to language-specified targets, compared to 100% for the state-aware expert, though requiring 3.2 times more steps on average. This performance gap reveals that pretrained embeddings effectively support basic language grounding but struggle with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oadamharoon/text2nav
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications