Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog   Navigation

Yifei Su; Dong An; Yuan Xu; Kehan Chen; Yan Huang

arXiv:2308.11561·cs.CV·December 15, 2023

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

Yifei Su, Dong An, Yuan Xu, Kehan Chen, Yan Huang

PDF

Open Access 2 Repos

TL;DR

This paper introduces TG-GAT, a graph-aware transformer framework for aerial navigation from dialog history, improving cross-modal grounding and achieving state-of-the-art results in the AVDN challenge.

Contribution

The paper proposes a novel graph-aware transformer model with an auxiliary grounding task and data augmentation for aerial navigation from dialog, advancing cross-modal understanding.

Findings

01

Achieved 2.2% and 3.0% improvements over baseline on SPL and SR metrics.

02

Developed a hybrid augmentation strategy using large language models.

03

Won the AVDN Challenge at ICCV CLVL 2023.

Abstract

This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition,an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge, with 2.2% and 3.0% absolute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections · Absolute Position Encodings · Residual Connection