FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

Dian Shao; Zhengzheng Xu; Peiyang Wang; Like Liu; Yule Wang; Jieqi Shi; Jing Huo

arXiv:2604.16298·cs.CV·April 20, 2026

FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

Dian Shao, Zhengzheng Xu, Peiyang Wang, Like Liu, Yule Wang, Jieqi Shi, Jing Huo

PDF

1 Repo 1 Datasets

TL;DR

FineCog-Nav introduces a cognitive-inspired modular framework for zero-shot UAV navigation, enhancing interpretability and performance in complex 3D environments by leveraging specialized models and a new benchmark.

Contribution

It presents a novel modular architecture inspired by human cognition, with role-specific prompts and protocols, and introduces the AerialVLN-Fine benchmark for detailed evaluation.

Findings

01

Outperforms zero-shot baselines in instruction adherence

02

Improves long-horizon planning and navigation accuracy

03

Demonstrates better generalization to unseen environments

Abstract

UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://smartdianlab.github.io/projects-FineCogNav
github

Datasets

Lozumi/AerialVLN-Fine
dataset· 149 dl
149 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.