VLD: Visual Language Goal Distance for Reinforcement Learning Navigation

Lazar Milikic; Manthan Patel; Jonas Frey

arXiv:2512.07976·cs.RO·March 17, 2026

VLD: Visual Language Goal Distance for Reinforcement Learning Navigation

Lazar Milikic, Manthan Patel, Jonas Frey

PDF

Open Access

TL;DR

This paper introduces VLD, a scalable framework for goal-conditioned robotic navigation that leverages a self-supervised distance predictor trained on internet video data, enabling effective sim-to-real transfer and semantic goal understanding.

Contribution

The paper proposes a novel decoupled learning framework that separates perception from policy training, utilizing a self-supervised distance predictor for improved navigation performance.

Findings

01

VLD outperforms prior temporal distance methods like ViNT and VIP.

02

The approach achieves strong sim-to-real transfer in robotic navigation tasks.

03

Decoupled training enables scalable and robust goal-conditioned policies.

Abstract

Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning. Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Advanced Neural Network Applications