TL;DR
AirNav is a large-scale UAV navigation dataset with natural instructions, enabling realistic training and evaluation of UAV vision-and-language models, and demonstrating state-of-the-art performance and transferability.
Contribution
The paper introduces AirNav, a comprehensive UAV navigation dataset with natural instructions, and proposes AirVLN-R1, a model achieving state-of-the-art results with real-world transferability.
Findings
AirVLN-R1 achieves 51.82% success rate on test-unseen split.
The dataset includes 137K navigation samples with natural instructions.
Real-world UAV experiments suggest promising sim-to-real transfer.
Abstract
Existing UAV vision-and-language navigation (VLN) benchmarks rarely provide realistic aerial scenes, natural process-level instructions, and sufficient scale simultaneously, making it difficult to systematically train and evaluate UAV VLN agents under realistic settings. To address this, we propose \textbf{AirNav}, a large-scale benchmark built on real urban aerial data, comprising 137K navigation samples with natural and diverse instructions generated via a human--LLM collaborative pipeline with 10 user personas. We conduct a systematic evaluation of representative approaches on AirNav, ranging from traditional models to multimodal large language models (MLLMs), under unified metrics with open-source implementations. We further propose \textbf{AirVLN-R1}, trained via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), achieving state-of-the-art performance with a 51.82\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
