SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning
Hongyu Song, Rishabh Dev Yadav, Cheng Guo, and Wei Pan

TL;DR
SoraNav enables UAVs to interpret natural language instructions for 3D navigation by integrating visual reasoning and adaptive decision strategies, significantly improving success rates and efficiency in complex environments.
Contribution
The paper introduces SoraNav, a novel framework that combines multi-modal visual annotation and adaptive decision making for zero-shot vision-language navigation of UAVs in 3D spaces.
Findings
Outperforms state-of-the-art baselines in success rate and efficiency.
Achieves 39.3% improvement in success rate in complex 3D scenarios.
Demonstrates robust real-world UAV navigation using natural language instructions.
Abstract
Autonomous navigation under natural language instructions represents a crucial step toward embodied intelligence, enabling complex task execution in environments ranging from industrial facilities to domestic spaces. However, language-driven 3D navigation for Unmanned Aerial Vehicles (UAVs) requires precise spatial reasoning, a capability inherently lacking in current zero-shot Vision-Language Models (VLMs) which often generate ambiguous outputs and cannot guarantee geometric feasibility. Furthermore, existing Vision-Language Navigation (VLN) methods are predominantly tailored for 2.5D ground robots, rendering them unable to generalize to the unconstrained 3D spatial reasoning required for aerial tasks in small-scale, cluttered environments. In this paper, we present SoraNav, a novel framework enabling zero-shot VLM reasoning for UAV task-centric navigation. To address the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
