Multi-model fusion for Aerial Vision and Dialog Navigation based on   human attention aids

Xinyi Wang; Xuan Cui; Danxu Li; Fang Liu; Licheng Jiao

arXiv:2308.14064·cs.CV·August 29, 2023

Multi-model fusion for Aerial Vision and Dialog Navigation based on human attention aids

Xinyi Wang, Xuan Cui, Danxu Li, Fang Liu, Licheng Jiao

PDF

Open Access

TL;DR

This paper introduces a multi-model fusion approach using human attention aids for aerial navigation, enabling drones to follow natural language commands more effectively by predicting navigation points and human attention.

Contribution

It proposes a novel fusion training method combining HAA-Transformer and HAA-LSTM models for aerial navigation with human attention guidance.

Findings

01

Achieves high success rate (SR) and SPL metrics.

02

Shows a 7% improvement in GP metrics over baseline.

03

Effectively predicts navigation points and human attention.

Abstract

Drones have been widely used in many areas of our daily lives. It relieves people of the burden of holding a controller all the time and makes drone control easier to use for people with disabilities or occupied hands. However, the control of aerial robots is more complicated compared to normal robots due to factors such as uncontrollable height. Therefore, it is crucial to develop an intelligent UAV that has the ability to talk to humans and follow natural language commands. In this report, we present an aerial navigation task for the 2023 ICCV Conversation History. Based on the AVDN dataset containing more than 3k recorded navigation trajectories and asynchronous human-robot conversations, we propose an effective method of fusion training of Human Attention Aided Transformer model (HAA-Transformer) and Human Attention Aided LSTM (HAA-LSTM) model, which achieves the prediction of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Dropout · Sigmoid Activation · Byte Pair Encoding · Adam