Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
Abdul Mohaimen Al Radi, Kunyang Li, Yuzhang Shang, Mubarak Shah, Yu Tian

TL;DR
Aero-World converts pretrained image-to-video diffusion models into controllable aerial video generators by integrating inertial control signals, enabling more accurate and stable drone motion simulation.
Contribution
The paper introduces Aero-World, a novel method for action-conditioned aerial video generation using a frozen physics probe for supervision during fine-tuning.
Findings
Aero-World improves Action Alignment Score from 57.7 to 63.6.
Aero-World achieves lower FVD and higher SSIM compared to AirScape.
Aero-World demonstrates better physical and motion consistency in generated videos.
Abstract
Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
