Retrieval-Augmented Open-Vocabulary Object Detection
Jooyeon Kim, Eulrang Cho, Sehyung Kim, Hyunwoo J. Kim

TL;DR
This paper introduces RALF, a retrieval-augmented approach for open-vocabulary object detection that enhances generalization by incorporating related negative classes and verbalized concepts, leading to improved detection of novel objects.
Contribution
The paper proposes Retrieval-Augmented Losses and visual Features (RALF), a novel method that retrieves negative classes and uses verbalized concepts to improve open-vocabulary object detection.
Findings
Achieves up to 3.4 AP improvement on COCO novel categories.
Improves 3.6 mask AP on LVIS dataset.
Demonstrates effectiveness of retrieval-augmented features in open-vocabulary detection.
Abstract
Open-vocabulary object detection (OVD) has been studied with Vision-Language Models (VLMs) to detect novel objects beyond the pre-trained categories. Previous approaches improve the generalization ability to expand the knowledge of the detector, using 'positive' pseudo-labels with additional 'class' names, e.g., sock, iPod, and alligator. To extend the previous methods in two aspects, we propose Retrieval-Augmented Losses and visual Features (RALF). Our method retrieves related 'negative' classes and augments loss functions. Also, visual features are augmented with 'verbalized concepts' of classes, e.g., worn on the feet, handheld music player, and sharp teeth. Specifically, RALF consists of two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual Features (RAF). RAL constitutes two losses reflecting the semantic similarity with negative vocabularies. In addition,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
