AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation

Jianli Sun; Bin Tian; Qiyao Zhang; Chengxiang Li; Zihan Song; Zhiyong Cui; Yisheng Lv; Yonglin Tian

arXiv:2601.21602·cs.RO·February 4, 2026

AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation

Jianli Sun, Bin Tian, Qiyao Zhang, Chengxiang Li, Zihan Song, Zhiyong Cui, Yisheng Lv, Yonglin Tian

PDF

Open Access

TL;DR

This paper introduces AIR-VLA, a comprehensive benchmark and dataset for vision-language-action modeling in aerial manipulation, addressing unique challenges of UAV dynamics and multi-step tasks, and evaluating current models' capabilities.

Contribution

It presents the first VLA benchmark tailored for aerial manipulation, including a simulation environment, a multimodal dataset, and systematic evaluation of existing models.

Findings

01

Current VLA models show limited performance on aerial tasks.

02

The benchmark reveals specific challenges in UAV mobility and manipulator control.

03

AIR-VLA provides a foundation for future aerial robotics research.

Abstract

While Vision-Language-Action (VLA) models have achieved remarkable success in ground-based embodied intelligence, their application to Aerial Manipulation Systems (AMS) remains a largely unexplored frontier. The inherent characteristics of AMS, including floating-base dynamics, strong coupling between the UAV and the manipulator, and the multi-step, long-horizon nature of operational tasks, pose severe challenges to existing VLA paradigms designed for static or 2D mobile bases. To bridge this gap, we propose \textbf{AIR-VLA}, the first VLA benchmark specifically tailored for aerial manipulation. We construct a physics-based simulation environment and release a high-quality multimodal dataset comprising 3000 manually teleoperated demonstrations, covering base manipulation, object \& spatial understanding, semantic reasoning, and long-horizon planning. Leveraging this platform, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Robot Manipulation and Learning