Openfly: A comprehensive platform for aerial vision-language navigation

Yunpeng Gao; Chenhui Li; Zhongrui You; Junli Liu; Zhen Li; Pengan Chen; Qizhi Chen; Zhonghan Tang; Liansheng Wang; Penghui Yang; Yiwen Tang; Yuhang Tang; Shuai Liang; Songyi Zhu; Ziqin Xiong; Yifei Su; Xinyi Ye; Jianan Li; Yan Ding; Dong Wang; Xuelong Li; Zhigang Wang; Bin Zhao

arXiv:2502.18041·cs.CV·March 3, 2026

Openfly: A comprehensive platform for aerial vision-language navigation

Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Xuelong Li, Zhigang Wang, Bin Zhao

PDF

Open Access 1 Models 3 Reviews

TL;DR

OpenFly is a comprehensive platform that advances aerial vision-language navigation by providing a versatile simulation environment, a large-scale dataset, and a novel agent model, facilitating research in outdoor aerial embodied AI.

Contribution

We introduce OpenFly, a new platform with diverse rendering engines, an automated data collection toolchain, and a large-scale aerial VLN dataset, addressing the lack of benchmarks in outdoor aerial navigation.

Findings

01

OpenFly outperforms existing VLN methods in aerial navigation tasks.

02

The dataset includes 100k trajectories across 18 diverse scenes.

03

The OpenFly-Agent model emphasizes key observations, improving navigation accuracy.

Abstract

Vision-Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large-scale benchmark for aerial VLN. Firstly, we integrate diverse rendering engines and advanced techniques for environment simulation, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of our environments. Secondly, we develop a highly automated toolchain for aerial VLN…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper directly tackles the most significant bottleneck in Aerial VLN. The automated toolchain is a highly valuable and practical contribution, drastically lowering the barrier to data collection. 2. The 100k-trajectory dataset is the largest to date. More importantly, the integration of four distinct rendering engines, especially the use of 3D GS for real-world reconstruction, ensures exceptional environmental diversity and realism. 3. The OpenFly-Agent, with its keyframe-aware selection

Weaknesses

1. Relying on the A* data-generation pipeline introduces a two problems: (1) The path style is unnatural, filled with "robotic sharp turns" instead of smooth, human-like flight. (2) The data is pure "expert demonstration", meaning the model only learns to follow perfect paths, not recover from deviations. This makes the model extremely fragile to real-world disturbances. 2. The system uses discrete motion control, which in practice, is more like performing control classification based on visual

Reviewer 02Rating 6Confidence 3

Strengths

1. Innovative Data Generation Platform: OpenFly integrates four rendering engines (Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting), which enhances the diversity of training environments for aerial vision-language navigation (VLN). This combination provides a wide range of realistic simulation environments for training models. 2. Automated Data Generation Toolchain: The platform features an automated toolchain for data collection, semantic segmentation, trajectory generation,

Weaknesses

1. Why use such an old model, Llama2-7b, as the baseline? 2. The paper cites OpenUAV, but why isn't there a comparison with their method? 3. VTM appears to be a pooling layer, but there's no explanation for why the performance improvement is so significant. 4. Are keyframes selected based on rules? What would be the difference if they were selected uniformly? Would the performance decrease if important frames are missed?

Reviewer 03Rating 4Confidence 5

Strengths

1. This paper is well written. It illustrates clearly the limitations of previous methods and how it addresses them in the newly proposed dataset. 2. The proposed dataset contains a high diversity of scenes (18 in total), while some correspond to real-world scenes 3. This paper proposes a strong baseline based on the performance OpenVLA model, outperforming other baselines by a large margin 4. The trained OpenVLA agent shows strong generalization to unseen scenes during testing 5. The trajector

Weaknesses

1. Although this paper provides details evaluation of OpenVLA agents, it lacks analysis on the effectiveness of the proposed dataset. In other words, this paper does not address the question "Do the diversity and quality of collected navigation trajectories facilitate more performant navigation agents, compared to other datasets?". I'd recommend the authors train OpenVLA on both OpenFly and OpenUAV datasets, and test on unseen scenes to highlight the effectiveness of the proposed dataset. 2. Thi

Code & Models

Models

🤗
IPEC-COMMUNITY/openfly-agent-7b
model· 227 dl
227 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Infrared Target Detection Methodologies