\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

Weiye Zhu; Zekai Zhang; Xiangchen Wang; Hewei Pan; Teng Wang; Tiantian Geng; Rongtao Xu; Feng Zheng

arXiv:2601.18188·cs.CV·March 17, 2026

\textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation

Weiye Zhu, Zekai Zhang, Xiangchen Wang, Hewei Pan, Teng Wang, Tiantian Geng, Rongtao Xu, Feng Zheng

PDF

Open Access

TL;DR

NaVIDA introduces inverse dynamics supervision and hierarchical action chunking to improve vision-language navigation, resulting in more stable, generalizable, and efficient agent behaviors in complex environments.

Contribution

The paper proposes a novel VLN framework that embeds action-grounded visual dynamics through inverse dynamics supervision and hierarchical action chunking, enhancing navigation stability and planning.

Findings

01

NaVIDA outperforms state-of-the-art methods with fewer parameters.

02

It demonstrates improved stability and generalization in navigation tasks.

03

Real-world robot tests confirm practical effectiveness.

Abstract

Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly action-grounded visual dynamics modeling. Lacking awareness of how actions transform subsequent visual observations, agents cannot plan actions rationally, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we introduce \textsc{NaVIDA} (\textbf{Nav}igation with \textbf{I}nverse \textbf{D}ynamics \textbf{A}ugmentation), a lightweight VLN framework that incorporates inverse dynamics supervision (IDS) as an explicit objective to embed action-grounded visual dynamics into policy learning. By jointly optimizing this visual dynamics with instruction-conditioned action prediction in a shared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robot Manipulation and Learning