Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
Hanxuan Chen, Jie Zheng, Siqi Yang, Tianle Zeng, Siwei Feng, Songsheng Cheng, Ruilong Ren, Hanzhong Guo, Shuai Yuan, Xiangyue Wang, Kangli Wang, and Ji Pei

TL;DR
This survey reviews progress and challenges in vision-and-language navigation for UAVs, highlighting technological evolution, key resources, and future research directions in complex 3D environments.
Contribution
It provides a structured taxonomy of UAV-VLN methods, analyzes current challenges, and proposes a comprehensive research roadmap for future advancements.
Findings
Evolution from modular to foundation model-based approaches
Identification of key challenges like sim-to-real gap and perception robustness
Proposal of future research directions including multi-agent coordination
Abstract
Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
