Policy-Guided World Model Planning for Language-Conditioned Visual Navigation
Amirhosein Chahe, Lifeng Zhou

TL;DR
PiJEPA is a two-stage framework combining learned policies and latent world model planning, significantly improving instruction-conditioned visual navigation in real-world tasks.
Contribution
The paper introduces PiJEPA, integrating a finetuned policy with a latent world model for faster, more accurate navigation based on natural language instructions.
Findings
PiJEPA outperforms standalone policies in goal-reaching accuracy.
Using policy-derived warm-start improves world model planning convergence.
Comparison of vision encoders DINOv2 and V-JEPA-2 shows impact on performance.
Abstract
Navigating to a visually specified goal given natural language instructions remains a fundamental challenge in embodied AI. Existing approaches either rely on reactive policies that struggle with long-horizon planning, or employ world models that suffer from poor action initialization in high-dimensional spaces. We present PiJEPA, a two-stage framework that combines the strengths of learned navigation policies with latent world model planning for instruction-conditioned visual navigation. In the first stage, we finetune an Octo-based generalist policy, augmented with a frozen pretrained vision encoder (DINOv2 or V-JEPA-2), on the CAST navigation dataset to produce an informed action distribution conditioned on the current observation and language instruction. In the second stage, we use this policy-derived distribution to warm-start Model Predictive Path Integral (MPPI) planning over a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
