HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation

Chengjie Fan; Cong Pan; Zijian Liu; Ningzhong Liu; Jie Qin

arXiv:2604.08883·cs.RO·April 13, 2026

HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation

Chengjie Fan, Cong Pan, Zijian Liu, Ningzhong Liu, Jie Qin

PDF

TL;DR

HTNav is a hybrid learning framework for urban aerial vision-and-language navigation that improves generalization, path planning, and spatial understanding in complex environments.

Contribution

It introduces a staged hybrid IL-RL training, tiered decision-making, and map learning modules to enhance urban aerial navigation performance.

Findings

01

Achieves state-of-the-art results on CityNav benchmark.

02

Significantly improves navigation accuracy in complex urban scenes.

03

Enhances robustness and spatial understanding in aerial navigation tasks.

Abstract

Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has attracted widespread attention, owing to its significant practical value in applications such as logistics delivery and urban inspection. However, existing methods face several challenges in complex urban environments, including insufficient generalization to unseen scenes, suboptimal performance in long-range path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that integrates Imitation Learning (IL) and Reinforcement Learning (RL) within a hybrid IL-RL framework. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.