TL;DR
This paper introduces a unified framework for aerial vision-language navigation that relies solely on monocular RGB observations, enabling UAVs to interpret natural language instructions and navigate complex environments efficiently.
Contribution
The authors propose a novel monocular RGB-only aerial VLN model with prompt-guided multi-task learning, keyframe selection, and action merging strategies, improving practicality and performance.
Findings
Achieves strong results on AerialVLN and OpenFly benchmarks.
Outperforms existing RGB-only baselines significantly.
Narrower gap with panoramic RGB-D methods.
Abstract
Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
