History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

Xichen Ding; Jianzhe Gao; Cong Pan; Wenguan Wang; Jie Qin

arXiv:2512.14222·cs.CV·December 18, 2025

History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation

Xichen Ding, Jianzhe Gao, Cong Pan, Wenguan Wang, Jie Qin

PDF

Open Access 1 Video

TL;DR

This paper introduces HETT, a two-stage transformer framework that improves aerial navigation by combining global reasoning and local scene understanding through a coarse-to-fine approach and structured spatial memory.

Contribution

The work presents a novel two-stage transformer model with a historical grid map for enhanced scene awareness in UAV navigation, along with refined dataset annotations.

Findings

01

HETT outperforms existing methods on the CityNav dataset.

02

The historical grid map improves spatial memory and scene understanding.

03

Ablation studies confirm the effectiveness of each component.

Abstract

Aerial Vision-and-Language Navigation (AVLN) requires Unmanned Aerial Vehicle (UAV) agents to localize targets in large-scale urban environments based on linguistic instructions. While successful navigation demands both global environmental reasoning and local scene comprehension, existing UAV agents typically adopt mono-granularity frameworks that struggle to balance these two aspects. To address this limitation, this work proposes a History-Enhanced Two-Stage Transformer (HETT) framework, which integrates the two aspects through a coarse-to-fine navigation pipeline. Specifically, HETT first predicts coarse-grained target positions by fusing spatial landmarks and historical context, then refines actions via fine-grained visual analysis. In addition, a historical grid map is designed to dynamically aggregate visual features into a structured spatial memory, enhancing comprehensive scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications