Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation
Zhenxing Xu, Brikit Lu, Weidong Bao, Zhengqiu Zhu, Junsong Zhang, Hui Yan, Wenhao Lu, Ji Wang

TL;DR
Fly0 introduces a novel framework that separates semantic understanding from geometric planning in aerial navigation, enhancing robustness, reducing latency, and improving success rates in complex environments.
Contribution
The paper presents Fly0, a three-stage decoupled system that integrates multimodal language reasoning with geometric planning for zero-shot aerial navigation.
Findings
Outperforms state-of-the-art baselines in success rate by over 20%
Reduces navigation error by approximately 50%
Operates efficiently without continuous inference
Abstract
Current Visual-Language Navigation (VLN) methodologies face a trade-off between semantic understanding and control precision. While Multimodal Large Language Models (MLLMs) offer superior reasoning, deploying them as low-level controllers leads to high latency, trajectory oscillations, and poor generalization due to weak geometric grounding. To address these limitations, we propose Fly0, a framework that decouples semantic reasoning from geometric planning. The proposed method operates through a three-stage pipeline: (1) an MLLM-driven module for grounding natural language instructions into 2D pixel coordinates; (2) a geometric projection module that utilizes depth data to localize targets in 3D space; and (3) a geometric planner that generates collision-free trajectories. This mechanism enables robust navigation even when visual contact is lost. By eliminating the need for continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Robot Manipulation and Learning
