Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability
Yusuke Hosoya, Masanori Suganuma, Takayuki Okatani

TL;DR
This paper evaluates the effectiveness of open-vocabulary versus closed-set object detection in few-shot scenarios, introducing a measure of text-describability to guide dataset categorization and method selection.
Contribution
It proposes a novel way to quantify dataset text-describability and empirically compares OVD and COD methods across different dataset categories.
Findings
OVD and COD perform similarly on low text-describability classes.
Increasing training data volume with OVD can be counterproductive for low-describability classes.
The proposed measure helps guide practitioners in choosing appropriate detection methods.
Abstract
Open-vocabulary object detection (OVD), detecting specific classes of objects using only their linguistic descriptions (e.g., class names) without any image samples, has garnered significant attention. However, in real-world applications, the target class concepts is often hard to describe in text and the only way to specify target objects is to provide their image examples, yet it is often challenging to obtain a good number of samples. Thus, there is a high demand from practitioners for few-shot object detection (FSOD). A natural question arises: Can the benefits of OVD extend to FSOD for object classes that are difficult to describe in text? Compared to traditional methods that learn only predefined classes (referred to in this paper as closed-set object detection, COD), can the extra cost of OVD be justified? To answer these questions, we propose a method to quantify the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training
