AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control
Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, Shaohua Wan

TL;DR
AerialVLA introduces a minimalist, end-to-end vision-language-action framework for UAV navigation that eliminates reliance on dense guidance and object detectors, achieving state-of-the-art results and superior generalization.
Contribution
The paper presents a novel end-to-end UAV navigation model that uses a simplified perception strategy and fuzzy directional prompts, enhancing autonomy and transferability.
Findings
Achieves state-of-the-art performance on TravelUAV benchmark.
Nearly three times higher success rate in unseen environments.
Demonstrates robustness with minimalist design and intrinsic control signals.
Abstract
Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Robotic Path Planning Algorithms · Multimodal Machine Learning Applications
