RT-OVAD: Real-Time Open-Vocabulary Aerial Object Detection via Image-Text Collaboration

Guoting Wei; Xia Yuan; Yu Liu; Zhenhao Shang; Xizhe Xue; Peng Wang; Kelu Yao; Chunxia Zhao; Haokui Zhang; Rong Xiao

arXiv:2408.12246·cs.CV·July 11, 2025·2 cites

RT-OVAD: Real-Time Open-Vocabulary Aerial Object Detection via Image-Text Collaboration

Guoting Wei, Xia Yuan, Yu Liu, Zhenhao Shang, Xizhe Xue, Peng Wang, Kelu Yao, Chunxia Zhao, Haokui Zhang, Rong Xiao

PDF

Open Access 1 Repo

TL;DR

RT-OVAD introduces a real-time open-vocabulary aerial object detection method that leverages image-text collaboration to detect diverse objects without predefined categories, achieving high accuracy and speed.

Contribution

It is the first to propose a real-time open-vocabulary aerial detector using image-text alignment and collaboration strategies, improving flexibility and efficiency.

Findings

01

Outperforms state-of-the-art methods on multiple benchmarks.

02

Achieves 87.7 AP50 and 34 FPS, demonstrating high accuracy and real-time speed.

03

Effective in open-vocabulary, zero-shot, and closed-set detection tasks.

Abstract

Aerial object detection plays a crucial role in numerous applications. However, most existing methods focus on detecting predefined object categories, limiting their applicability in real-world open scenarios. In this paper, we extend aerial object detection to open scenarios through image-text collaboration and propose RT-OVAD, the first real-time open-vocabulary detector for aerial scenes. Specifically, we first introduce an image-to-text alignment loss to replace the conventional category regression loss, thereby eliminating category constraints. Next, we propose a lightweight image-text collaboration strategy comprising an image-text collaboration encoder and a text-guided decoder. The encoder simultaneously enhances visual features and refines textual embeddings, while the decoder guides object queries to focus on class-relevant image features. This design further improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GT-Wei/RT-OVAD
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications