NaVILA: Legged Robot Vision-Language-Action Model for Navigation
An-Chieh Cheng, Yandong Ji, Zhaojing Yang, Zaitian Gongye, Xueyan Zou,, Jan Kautz, Erdem B{\i}y{\i}k, Hongxu Yin, Sifei Liu, Xiaolong Wang

TL;DR
NaVILA introduces a hierarchical framework for legged robot navigation that translates human language commands into mid-level actions and low-level controls, enabling effective navigation in complex environments.
Contribution
The paper presents NaVILA, a novel two-level model unifying vision-language understanding with locomotion skills for legged robots, improving navigation in cluttered scenes.
Findings
Outperforms previous methods on existing benchmarks.
Demonstrates effectiveness in realistic scenes and real-world experiments.
Enables flexible human-robot communication for navigation.
Abstract
This paper proposes to solve the problem of Vision-and-Language Navigation with legged robots, which not only provides a flexible way for humans to command but also allows the robot to navigate through more challenging and cluttered scenes. However, it is non-trivial to translate human language instructions all the way to low-level leg joint actions. We propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with locomotion skills. Instead of directly predicting low-level actions from VLA, NaVILA first generates mid-level actions with spatial information in the form of language, (e.g., "moving forward 75cm"), which serves as an input for a visual locomotion RL policy for execution. NaVILA substantially improves previous approaches on existing benchmarks. The same advantages are demonstrated in our newly developed benchmarks with IsaacLab, featuring more…
Peer Reviews
Decision·Submitted to ICLR 2025
- Real-World Deployment: The authors have successfully implemented their navigation policy on a real legged robot, proving its capability to execute complex navigational tasks in diverse real-world scenarios, including both indoor and outdoor settings. - Good experimental results in simulated navigation environments: The proposed method outperforms previous approaches for navigation in continuous environments when using only a single RGB camera.
A significant issue with the paper is that, despite its emphasis on legged locomotion, the navigation skills demonstrated do not inherently require this form of mobility. All the navigation tasks demonstrated could likely be accomplished by wheeled robots without altering the proposed framework. In contrast, prior works on VLA models for legged robots have incorporated capabilities unique to legged robots, such as climbing and crawling under obstacles [1, 2, 3, 4]. The action space of the propos
1. This paper extends NaVid to legged robot scenarios and demonstrates strong performance in real-world applications. 2. The framework effectively integrates 3D scenes into the policy, enhancing robustness, particularly in real scenarios. This is a commendable attempt.
1. One issue that remains is the inference time, particularly in the demo where the quadruped robot must wait for the large model to complete inference before executing tasks. This waiting period for model inference can lead to instability in motion, which is a point worth considering for resolution. 2. It is unclear whether the model's performance has been tested in an open environment. Currently, the demos presented are tests conducted in a semi-open environment. The authors could consider in
- The paper contributes insights into how to adapt a VLM to predict actions conditioned on present and past visual context. The paper shows how to factor the navigation task as predicting the mid-level actions described in a text form followed by learning a low-level policy. The paper describes the training strategy and SFT data blend and as well as inclusion of auxiliary tasks during training. - Authors demonstrate their results on common benchmarks, extensively comparing their results to ot
A key motivation for NaVILA is the ability to generalise using the separation between high-level action prediction and the presence of a low-level policy. The experiments demonstrated are impressive in their sequentiality but do not fully highlight the generalization capacity. Specifically, where the low-level plan changes with high-level inputs and visa versa.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Path Planning Algorithms · Robotics and Sensor-Based Localization · Robotics and Automated Systems
