Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation
Yifei Su, Dong An, Yuan Xu, Kehan Chen, Yan Huang

TL;DR
This paper introduces TG-GAT, a graph-aware transformer framework for aerial navigation from dialog history, improving cross-modal grounding and achieving state-of-the-art results in the AVDN challenge.
Contribution
The paper proposes a novel graph-aware transformer model with an auxiliary grounding task and data augmentation for aerial navigation from dialog, advancing cross-modal understanding.
Findings
Achieved 2.2% and 3.0% improvements over baseline on SPL and SR metrics.
Developed a hybrid augmentation strategy using large language models.
Won the AVDN Challenge at ICCV CLVL 2023.
Abstract
This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition,an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge, with 2.2% and 3.0% absolute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections · Absolute Position Encodings · Residual Connection
