Language-Conditioned World Modeling for Visual Navigation

Yifei Dong; Fengyi Wu; Yilong Dai; Lingdong Kong; Guangyu Chen; Xu Zhu; Qiyu Hu; Tianyu Wang; Johnalbert Garnica; Feng Liu; Siyu Huang; Qi Dai; Zhi-Qi Cheng

arXiv:2603.26741·cs.CV·March 31, 2026

Language-Conditioned World Modeling for Visual Navigation

Yifei Dong, Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, Qiyu Hu, Tianyu Wang, Johnalbert Garnica, Feng Liu, Siyu Huang, Qi Dai, Zhi-Qi Cheng

PDF

1 Repo

TL;DR

This paper introduces a new benchmark and models for language-conditioned visual navigation, enabling agents to follow natural language instructions without goal images by predicting future states and actions.

Contribution

It presents the LCVN dataset and two novel model families that integrate language grounding, world modeling, and action prediction for visual navigation tasks.

Findings

01

The diffusion-based world model offers temporally coherent rollouts.

02

The autoregressive architecture generalizes better to unseen environments.

03

Joint study of language grounding, imagination, and policy learning is valuable.

Abstract

We study language-conditioned visual navigation (LCVN), in which an embodied agent is asked to follow a natural language instruction based only on an initial egocentric observation. Without access to goal images, the agent must rely on language to shape its perception and continuous control, making the grounding problem particularly challenging. We formulate this problem as open-loop trajectory prediction conditioned on linguistic instructions and introduce the LCVN Dataset, a benchmark of 39,016 trajectories and 117,048 human-verified instructions that supports reproducible research across a range of environments and instruction styles. Using this dataset, we develop LCVN frameworks that link language grounding, future-state prediction, and action generation through two complementary model families. The first family combines LCVN-WM, a diffusion-based world model, with LCVN-AC, an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

F1y1113/LCVN
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.