TL;DR
This paper introduces OS-W2S, an automatic labeling engine for language-guided open-set aerial object detection, and constructs a large-scale dataset MI-OAD to improve fine-grained detection and grounding in aerial imagery.
Contribution
The paper presents a novel automatic annotation pipeline and a large-scale dataset for language-guided open-set aerial detection, enabling significant performance improvements and zero-shot transfer capabilities.
Findings
MI-OAD contains 163,023 images and 2 million captions, 40 times larger than comparable datasets.
Training on MI-OAD improves AP50 by +31.1 and Recall@10 by +34.7 for zero-shot detection.
Pre-training on MI-OAD achieves state-of-the-art results on multiple benchmarks.
Abstract
In recent years, language-guided open-set aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary-level descriptions, which fail to meet the demands of fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the OS-W2S Label Engine, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial…
Peer Reviews
Decision·Submitted to ICLR 2026
- Visual grounding has significant value and wide applications. Yet, existing dataset is not large enough to support the task. This paper proposed an automated way to generate grounding dataset using VLMs. The dataset will advance the research in this direction. - The labeling pipeline is novel for aerial domains, combining structured preprocessing, VLM interaction, and BERT-based postprocessing. MI-OAD’s scale and multi-granularity annotation approach make it a comprehensive dataset for open-s
- Although sourced from eight aerial datasets, details about geographic, environmental, or temporal diversity are sparse. It is unclear whether MI-OAD adequately represents different regions, seasons, or sensor modalities. - The label engine relies heavily on a single chosen VLM (InternVL-2.5-38B-AWQ), and the paper does not assess how dataset quality varies across models, e.g. evaluating usinng other VLMs. - Only a very small portion of dataset is manually reviewed (0.5% of data). The genera
- The paper tackles the lack of large-scale language-grounded datasets in the aerial domain, which is a real bottleneck for open-set detection research. MI-OAD is significantly larger and more diverse than existing aerial grounding datasets. - Experiments are extensive and show clear improvements across several downstream benchmarks. The dataset and code are publicly released, making the work reproducible and potentially useful to the community.
- The discussion of related work focuses almost entirely on model architectures rather than dataset construction. Since this paper’s main contribution is a dataset and annotation pipeline, it should instead position the work within the context of existing dataset-building methodologies. A detailed quantitative comparison with prior aerial or language-grounded datasets is missing. The paper should explicitly articulate what is new about the proposed pipeline beyond scale, and how its annotation s
The paper is well organized. The proposed MI-OAD dataset is a valuable, large-scale dataset for the community. Experiments on YOLO-World and Grounding DINO demonstrated that MI-OAD can improve the model's performance in aerial object detection.
The core of the paper is to use VLM to generate annotations, but VLM itself may have biases (such as preferences for specific colors and shapes) and the risk of creating "illusions". Although the paper validates this with a stronger model in Section 4, "Quality Control Analysis", it does not fundamentally avoid the problem. In section 5.4, Key terms like "OPT-RSVG" and "DIOR-RSVG" are not defined. Why is the “Grounding DINO (+MI-PAD)” configuration in Table 2 much lower than the “LPVA” baselin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Layer Normalization · Softmax · Residual Connection · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels · Focus
