Efficient Image-Goal Navigation with Representative Latent World Model
Zhiwei Zhang, Hui Zhang, Kaihong Huang, Chenghao Shi, Huimin Lu

TL;DR
This paper introduces ReL-NWM, a novel latent space world model for efficient image-goal navigation that bypasses pixel-level reconstruction, enabling fast planning and successful real-world deployment.
Contribution
The paper presents ReL-NWM, a high-level semantic latent space model that improves navigation efficiency and performance over traditional pixel-based world models.
Findings
Achieves state-of-the-art trajectory prediction accuracy.
Demonstrates effective image-goal navigation in benchmarks.
Successfully deployed on a real humanoid robot.
Abstract
World models enable robots to conduct counterfactual reasoning in physical environments by predicting future world states. While conventional approaches often prioritize pixel-level reconstruction of future scenes, such detailed rendering is computationally intensive and unnecessary for planning tasks like navigation. We therefore propose that prediction and planning can be efficiently performed directly within a latent space of high-level semantic representations. To realize this, we introduce the Representative Latent space Navigation World Model (ReL-NWM). Rather than relying on reconstructionoriented latent embeddings, our method leverages a pre-trained representation encoder, DINOv3, and incorporates specialized mechanisms to effectively integrate action signals and historical context within this representation space. By operating entirely in the latent domain, our model bypasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
