Policy-Guided World Model Planning for Language-Conditioned Visual Navigation

Amirhosein Chahe; Lifeng Zhou

arXiv:2603.25981·cs.RO·March 30, 2026

Policy-Guided World Model Planning for Language-Conditioned Visual Navigation

Amirhosein Chahe, Lifeng Zhou

PDF

TL;DR

PiJEPA is a two-stage framework combining learned policies and latent world model planning, significantly improving instruction-conditioned visual navigation in real-world tasks.

Contribution

The paper introduces PiJEPA, integrating a finetuned policy with a latent world model for faster, more accurate navigation based on natural language instructions.

Findings

01

PiJEPA outperforms standalone policies in goal-reaching accuracy.

02

Using policy-derived warm-start improves world model planning convergence.

03

Comparison of vision encoders DINOv2 and V-JEPA-2 shows impact on performance.

Abstract

Navigating to a visually specified goal given natural language instructions remains a fundamental challenge in embodied AI. Existing approaches either rely on reactive policies that struggle with long-horizon planning, or employ world models that suffer from poor action initialization in high-dimensional spaces. We present PiJEPA, a two-stage framework that combines the strengths of learned navigation policies with latent world model planning for instruction-conditioned visual navigation. In the first stage, we finetune an Octo-based generalist policy, augmented with a frozen pretrained vision encoder (DINOv2 or V-JEPA-2), on the CAST navigation dataset to produce an informed action distribution conditioned on the current observation and language instruction. In the second stage, we use this policy-derived distribution to warm-start Model Predictive Path Integral (MPPI) planning over a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.