Aerial Vision-and-Dialog Navigation
Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, Xin Eric, Wang

TL;DR
This paper introduces Aerial Vision-and-Dialog Navigation (AVDN), enabling drones to follow natural language commands through a new dataset and a transformer-based model that predicts navigation and human attention.
Contribution
It presents a new dataset for aerial navigation via dialogue and a novel transformer model that incorporates human attention to improve navigation accuracy.
Findings
The AVDN dataset contains over 3,000 navigation trajectories with dialogs.
The HAA-Transformer effectively predicts navigation waypoints and human attention.
Results show improved navigation performance with attention prediction.
Abstract
The ability to converse with humans and follow natural language commands is crucial for intelligent unmanned aerial vehicles (a.k.a. drones). It can relieve people's burden of holding a controller all the time, allow multitasking, and make drone control more accessible for people with disabilities or with their hands occupied. To this end, we introduce Aerial Vision-and-Dialog Navigation (AVDN), to navigate a drone via natural language conversation. We build a drone simulator with a continuous photorealistic environment and collect a new AVDN dataset of over 3k recorded navigation trajectories with asynchronous human-human dialogs between commanders and followers. The commander provides initial navigation instruction and further guidance by request, while the follower navigates the drone in the simulator and asks questions when needed. During data collection, followers' attention on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Human Pose and Action Recognition
