AirScape: An Aerial Generative World Model with Motion Controllability

Baining Zhao; Rongze Tang; Mingyuan Jia; Ziyou Wang; Fanghang Man; Xin Zhang; Yu Shang; Weichen Zhang; Wei Wu; Chen Gao; Xinlei Chen; Yong Li

arXiv:2507.08885·cs.RO·October 13, 2025

AirScape: An Aerial Generative World Model with Motion Controllability

Baining Zhao, Rongze Tang, Mingyuan Jia, Ziyou Wang, Fanghang Man, Xin Zhang, Yu Shang, Weichen Zhang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li

PDF

1 Models 2 Datasets

TL;DR

AirScape is a novel world model for aerial agents that predicts future observations from visual inputs and motion intentions, enabling controllable 3D spatial imagination in complex environments.

Contribution

We introduce the first aerial world model with motion controllability, trained on a new dataset, enabling agents to predict outcomes of their own 6-DoF movements.

Findings

01

Outperforms existing models in 3D spatial imagination

02

Achieves over 50% improvement in motion alignment metrics

03

Demonstrates effective control of aerial agents in diverse scenarios

Abstract

How to enable agents to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore general spatial imagination capability, we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct a dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase schedule to train a foundation model--initially devoid of embodied spatial knowledge--into a world model that is controllable by motion intentions and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
EmbodiedCity/Airscape
model· ♡ 1
♡ 1

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.