AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

Peng Xu; Zhengnan Deng; Jiayan Deng; Zonghua Gu; Shaohua Wan

arXiv:2603.14363·cs.CV·March 17, 2026

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, Shaohua Wan

PDF

Open Access

TL;DR

AerialVLA introduces a minimalist, end-to-end vision-language-action framework for UAV navigation that eliminates reliance on dense guidance and object detectors, achieving state-of-the-art results and superior generalization.

Contribution

The paper presents a novel end-to-end UAV navigation model that uses a simplified perception strategy and fuzzy directional prompts, enhancing autonomy and transferability.

Findings

01

Achieves state-of-the-art performance on TravelUAV benchmark.

02

Nearly three times higher success rate in unseen environments.

03

Demonstrates robustness with minimalist design and intrinsic control signals.

Abstract

Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Robotic Path Planning Algorithms · Multimodal Machine Learning Applications