Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Huilin Xu; Zhuoyang Liu; Yixiang Luomei; Feng Xu

arXiv:2512.08639·cs.CV·April 16, 2026

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu

PDF

1 Repo

TL;DR

This paper introduces a unified framework for aerial vision-language navigation that relies solely on monocular RGB observations, enabling UAVs to interpret natural language instructions and navigate complex environments efficiently.

Contribution

The authors propose a novel monocular RGB-only aerial VLN model with prompt-guided multi-task learning, keyframe selection, and action merging strategies, improving practicality and performance.

Findings

01

Achieves strong results on AerialVLN and OpenFly benchmarks.

02

Outperforms existing RGB-only baselines significantly.

03

Narrower gap with panoramic RGB-D methods.

Abstract

Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

return-sleep/AeroAct
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.