AerialVLN: Vision-and-Language Navigation for UAVs
Shubo Liu, Hongsheng Zhang, Yuankai Qi, Peng Wang, Yaning, Zhang, Qi Wu

TL;DR
AerialVLN introduces a new UAV-based vision-and-language navigation task in outdoor environments, supported by a realistic 3D simulator, highlighting the complexity of aerial navigation and the gap between current models and human performance.
Contribution
The paper proposes AerialVLN, a novel UAV-based VLN task with a realistic 3D simulator, and extends baseline models to address aerial navigation challenges.
Findings
Baseline models lag behind human performance.
AerialVLN presents a more complex navigation environment.
The dataset and simulator facilitate future research.
Abstract
Recently emerged Vision-and-Language Navigation (VLN) tasks have drawn significant attention in both computer vision and natural language processing communities. Existing VLN tasks are built for agents that navigate on the ground, either indoors or outdoors. However, many tasks require intelligent agents to carry out in the sky, such as UAV-based goods delivery, traffic/security patrol, and scenery tour, to name a few. Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning. To fill this gap and facilitate research in this field, we propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. Our simulator supports continuous navigation, environment extension and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
