RT-OVAD: Real-Time Open-Vocabulary Aerial Object Detection via Image-Text Collaboration
Guoting Wei, Xia Yuan, Yu Liu, Zhenhao Shang, Xizhe Xue, Peng Wang, Kelu Yao, Chunxia Zhao, Haokui Zhang, Rong Xiao

TL;DR
RT-OVAD introduces a real-time open-vocabulary aerial object detection method that leverages image-text collaboration to detect diverse objects without predefined categories, achieving high accuracy and speed.
Contribution
It is the first to propose a real-time open-vocabulary aerial detector using image-text alignment and collaboration strategies, improving flexibility and efficiency.
Findings
Outperforms state-of-the-art methods on multiple benchmarks.
Achieves 87.7 AP50 and 34 FPS, demonstrating high accuracy and real-time speed.
Effective in open-vocabulary, zero-shot, and closed-set detection tasks.
Abstract
Aerial object detection plays a crucial role in numerous applications. However, most existing methods focus on detecting predefined object categories, limiting their applicability in real-world open scenarios. In this paper, we extend aerial object detection to open scenarios through image-text collaboration and propose RT-OVAD, the first real-time open-vocabulary detector for aerial scenes. Specifically, we first introduce an image-to-text alignment loss to replace the conventional category regression loss, thereby eliminating category constraints. Next, we propose a lightweight image-text collaboration strategy comprising an image-text collaboration encoder and a text-guided decoder. The encoder simultaneously enhances visual features and refines textual embeddings, while the decoder guides object queries to focus on class-relevant image features. This design further improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications
