Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments

Amirreza Payandeh; Anuj Pokhrel; Daeun Song; Marcos Zampieri; and Xuesu Xiao

arXiv:2506.14233·cs.RO·June 18, 2025

Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments

Amirreza Payandeh, Anuj Pokhrel, Daeun Song, Marcos Zampieri, and Xuesu Xiao

PDF

Open Access

TL;DR

Narrate2Nav is a real-time visual navigation model that uses implicit language reasoning and social cues, trained with a novel self-supervised framework, to improve robot navigation in human-centric environments.

Contribution

It introduces a self-supervised learning framework embedding natural language reasoning and social cues into a visual encoder for real-time robot navigation.

Findings

01

Achieved over 50% improvement in offline dataset navigation accuracy.

02

Demonstrated real-world navigation success with over 40% improvement.

03

Visual attention maps show enhanced focus on critical scene elements.

Abstract

Large Vision-Language Models (VLMs) have demonstrated potential in enhancing mobile robot navigation in human-centric environments by understanding contextual cues, human intentions, and social dynamics while exhibiting reasoning capabilities. However, their computational complexity and limited sensitivity to continuous numerical data impede real-time performance and precise motion control. To this end, we propose Narrate2Nav, a novel real-time vision-action model that leverages a novel self-supervised learning framework based on the Barlow Twins redundancy reduction loss to embed implicit natural language reasoning, social cues, and human intentions within a visual encoder-enabling reasoning in the model's latent space rather than token space. The model combines RGB inputs, motion commands, and textual signals of scene context during training to bridge from robot observations to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques