AirScape: An Aerial Generative World Model with Motion Controllability
Baining Zhao, Rongze Tang, Mingyuan Jia, Ziyou Wang, Fanghang Man, Xin Zhang, Yu Shang, Weichen Zhang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li

TL;DR
AirScape is a novel world model for aerial agents that predicts future observations from visual inputs and motion intentions, enabling controllable 3D spatial imagination in complex environments.
Contribution
We introduce the first aerial world model with motion controllability, trained on a new dataset, enabling agents to predict outcomes of their own 6-DoF movements.
Findings
Outperforms existing models in 3D spatial imagination
Achieves over 50% improvement in motion alignment metrics
Demonstrates effective control of aerial agents in diverse scenarios
Abstract
How to enable agents to predict the outcomes of their own motion intentions in three-dimensional space has been a fundamental problem in embodied intelligence. To explore general spatial imagination capability, we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. AirScape predicts future observation sequences based on current visual inputs and motion intentions. Specifically, we construct a dataset for aerial world model training and testing, which consists of 11k video-intention pairs. This dataset includes first-person-view videos capturing diverse drone actions across a wide range of scenarios, with over 1,000 hours spent annotating the corresponding motion intentions. Then we develop a two-phase schedule to train a foundation model--initially devoid of embodied spatial knowledge--into a world model that is controllable by motion intentions and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
