WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

Baining Zhao; Jiacheng Xu; Weicheng Feng; Xin Zhang; Zhaolu Wang; Haoyang Wang; Shilong Ji; Ziyou Wang; Jianjie Fang; Zhiheng Zheng; Weichen Zhang; Yu Shang; Wei Wu; Chen Gao; Xinlei Chen; Yong Li

arXiv:2605.15964·cs.RO·May 18, 2026

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li

PDF

1 Repo 1 Models

TL;DR

WorldVLN introduces an autoregressive world action model for aerial vision-language navigation, enabling agents to predict world states and improve navigation performance in 3D environments.

Contribution

It is the first to formulate aerial VLN as a prediction-driven world-action problem using an autoregressive model and a novel two-stage training framework.

Findings

01

Outperforms existing baselines with 12%+ success-rate gains.

02

Enables zero-shot transfer to real drone deployment.

03

Demonstrates effectiveness on outdoor and indoor benchmarks.

Abstract

Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://embodiedcity.github.io/WorldVLN
github

Models

🤗
EmbodiedCity/WorldVLN
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.