AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation
Jianli Sun, Bin Tian, Qiyao Zhang, Chengxiang Li, Zihan Song, Zhiyong Cui, Yisheng Lv, Yonglin Tian

TL;DR
This paper introduces AIR-VLA, a comprehensive benchmark and dataset for vision-language-action modeling in aerial manipulation, addressing unique challenges of UAV dynamics and multi-step tasks, and evaluating current models' capabilities.
Contribution
It presents the first VLA benchmark tailored for aerial manipulation, including a simulation environment, a multimodal dataset, and systematic evaluation of existing models.
Findings
Current VLA models show limited performance on aerial tasks.
The benchmark reveals specific challenges in UAV mobility and manipulator control.
AIR-VLA provides a foundation for future aerial robotics research.
Abstract
While Vision-Language-Action (VLA) models have achieved remarkable success in ground-based embodied intelligence, their application to Aerial Manipulation Systems (AMS) remains a largely unexplored frontier. The inherent characteristics of AMS, including floating-base dynamics, strong coupling between the UAV and the manipulator, and the multi-step, long-horizon nature of operational tasks, pose severe challenges to existing VLA paradigms designed for static or 2D mobile bases. To bridge this gap, we propose \textbf{AIR-VLA}, the first VLA benchmark specifically tailored for aerial manipulation. We construct a physics-based simulation environment and release a high-quality multimodal dataset comprising 3000 manually teleoperated demonstrations, covering base manipulation, object \& spatial understanding, semantic reasoning, and long-horizon planning. Leveraging this platform, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Robot Manipulation and Learning
