NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction

Fei Liu; Shichao Xie; Minghua Luo; Zedong Chu; Junjun Hu; Xiaolong Wu; Mu Xu

arXiv:2512.01550·cs.RO·March 16, 2026

NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction

Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, Mu Xu

PDF

Open Access

TL;DR

NavForesee introduces a unified vision-language model that combines high-level planning and environmental prediction to improve long-horizon embodied navigation guided by natural language instructions.

Contribution

It presents a novel VLM that simultaneously performs planning and world prediction within a single framework for better navigation in unseen environments.

Findings

01

Achieves competitive performance on R2R-CE and RxR-CE benchmarks.

02

Effectively decomposes tasks and tracks progress using a unified model.

03

Demonstrates the benefit of combining language planning with environment prediction.

Abstract

Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Language, Metaphor, and Cognition