AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild
Xiaolou Sun, Wufei Si, Wenhui Ni, Yuntian Li, Dongming Wu, Fei Xie, Runwei Guan, He-Yang Xu, Henghui Ding, Yuan Wu, Yutao Yue, Yongming Huang, Hui Xiong

TL;DR
AutoFly is an innovative vision-language-action model enabling UAVs to autonomously navigate complex outdoor environments by integrating depth perception, continuous planning, and obstacle avoidance, outperforming existing baselines.
Contribution
The paper introduces AutoFly, a novel end-to-end VLA model with a pseudo-depth encoder and a new real-world autonomous navigation dataset, advancing UAV autonomous navigation in unstructured environments.
Findings
AutoFly achieves 3.9% higher success rate than state-of-the-art baselines.
The model performs consistently across simulated and real-world environments.
The new dataset emphasizes autonomous planning and obstacle avoidance.
Abstract
Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial…
Peer Reviews
Decision·ICLR 2026 Poster
1. The novel integration of a pseudo-depth encoder. This enhances the model's geometric reasoning for obstacle avoidance and safe navigation without needing physical depth sensors, effectively bridging a critical gap for real-world deployment where detailed environmental data is unavailable. 2. The creation of a comprehensive autonomous navigation dataset. The dataset uniquely emphasizes real-world challenges like continuous obstacle avoidance and includes real flight data, facilitating robust s
1. VLM approach may cause computationally intensive over the more efficient reinforcement learning approach for obstacle avoidance and navigation. The authors should clarify why the VLM approach is irreplaceable over established RL methods when only obstacle avoidance and basic navigation are required. 2. The Pseudo-Depth Encoder only showed a 4% performance improvement in ablation study, raising doubts about the module's effectiveness. Could you provide examples of extreme scenarios such as dyn
1. The paper correctly identifies a significant limitation in existing UAV VLN research: an over-reliance on detailed, step-by-step instructions that are often unavailable in real-world, unknown environments. The proposed shift to a paradigm using only coarse directional guidance is a practical and valuable step toward more robust, autonomous agents that can operate with minimal human guidance. 2. The authors recognize that existing datasets are ill-suited for this new, autonomous navigation tas
1. The title and abstract promise autonomous navigation "in the wild". However, the real-world experiments are limited in scope and do not support this claim. The paper states real-world data is acquired "within controlled laboratory environments". The visualization of the real-world test in Figure 5, Figure 15 clearly shows a structured, indoor lab setting, not a dynamic "wild" environment. 2. The model's formulation defines the policy as taking only the current RGB observation $o_t$ as input,
- The shift from detailed instruction-following VLN to autonomous navigation with coarse guidance addresses a real deployment gap. - Using monocular depth estimation (Depth Anything V2) instead of depth sensors is elegant and practical. It avoids sim-to-real depth sensor gaps while adding spatial reasoning capabilities with only RGB cameras. - The shared-weight design for depth and visual token projection is simple yet effective, enforcing consistent cross-modal representations.
- The 3.9% improvement in success rate over OpenVLA (47.9% vs 44%) is modest, given the additional pseudo-depth encoder, depth generator, and specialized projectors. The paper lacks analysis to show whether simpler approaches (e.g., better vision encoders or data augmentation) could achieve similar gains without architectural changes. - The average obstacle encounter rate of 10 is significantly lower than comparable UAV datasets (AerialVLN: 83, OpenUAV: 104, CityNav: 26). This contradicts the pa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
