HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
Chengjie Fan, Cong Pan, Zijian Liu, Ningzhong Liu, Jie Qin

TL;DR
HTNav is a hybrid learning framework for urban aerial vision-and-language navigation that improves generalization, path planning, and spatial understanding in complex environments.
Contribution
It introduces a staged hybrid IL-RL training, tiered decision-making, and map learning modules to enhance urban aerial navigation performance.
Findings
Achieves state-of-the-art results on CityNav benchmark.
Significantly improves navigation accuracy in complex urban scenes.
Enhances robustness and spatial understanding in aerial navigation tasks.
Abstract
Inspired by the general Vision-and-Language Navigation (VLN) task, aerial VLN has attracted widespread attention, owing to its significant practical value in applications such as logistics delivery and urban inspection. However, existing methods face several challenges in complex urban environments, including insufficient generalization to unseen scenes, suboptimal performance in long-range path planning, and inadequate understanding of spatial continuity. To address these challenges, we propose HTNav, a new collaborative navigation framework that integrates Imitation Learning (IL) and Reinforcement Learning (RL) within a hybrid IL-RL framework. This framework adopts a staged training mechanism to ensure the stability of the basic navigation strategy while enhancing its environmental exploration capability. By integrating a tiered decision-making mechanism, it achieves collaborative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
