Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Liuyi Wang; Xinyuan Xia; Hui Zhao; Hanqing Wang; Tai Wang; Yilun Chen; Chengju Liu; Qijun Chen; Jiangmiao Pang

arXiv:2507.13019·cs.RO·September 29, 2025

Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities

Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, Jiangmiao Pang

PDF

Open Access 1 Models

TL;DR

This paper introduces VLN-PE, a realistic robotic platform for vision-and-language navigation, revealing significant challenges in deploying current models physically and providing a foundation for more robust, adaptable VLN systems.

Contribution

We present VLN-PE, a comprehensive, physically realistic VLN platform that enables systematic evaluation of navigation models in real robot settings, highlighting key physical and environmental challenges.

Findings

01

Performance drops due to limited observation space and lighting variations.

02

Physical challenges like collisions and falls significantly impact navigation success.

03

Legged robots face locomotion constraints in complex environments.

Abstract

Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
InternRobotics/VLN-PE
model· 6 dl· ♡ 5
6 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition · Subtitles and Audiovisual Media · Categorization, perception, and language