Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

Xingyu Xia; Lekai Zhou; Yujie Tang; Xiaozhou Zhu; Hai Zhu; Wen Yao

arXiv:2604.07705·cs.RO·April 10, 2026

Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

Xingyu Xia, Lekai Zhou, Yujie Tang, Xiaozhou Zhu, Hai Zhu, Wen Yao

PDF

TL;DR

This survey reviews recent advances in aerial vision-language navigation, highlighting the integration of large language models, analyzing architectures, evaluation methods, and identifying key open challenges for UAV autonomous navigation.

Contribution

It provides a comprehensive taxonomy and critical analysis of Aerial VLN methods, datasets, and evaluation, and proposes future research directions.

Findings

01

Organized Aerial VLN methods into five architectural categories.

02

Identified gaps in datasets, simulation platforms, and evaluation metrics.

03

Highlighted key open problems like long-horizon grounding and multi-UAV navigation.

Abstract

Aerial vision-and-language navigation (Aerial VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and autonomously navigate complex three-dimensional environments by grounding language in visual perception. This survey provides a critical and analytical review of the Aerial VLN field, with particular attention to the recent integration of large language models (LLMs) and vision-language models (VLMs). We first formally introduce the Aerial VLN problem and define two interaction paradigms: single-instruction and dialog-based, as foundational axes. We then organize the body of Aerial VLN methods into a taxonomy of five architectural categories: sequence-to-sequence and attention-based methods, end-to-end LLM/VLM methods, hierarchical methods, multi-agent methods, and dialog-based navigation methods. For each category, we systematically analyze…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.