DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection
Shilin Xu, Xiangtai Li, Size Wu, Wenwei Zhang, Yunhai Tong, Chen, Change Loy

TL;DR
DST-Det introduces a simple, efficient self-training approach leveraging pre-trained vision-language models to improve open-vocabulary object detection, enhancing recall and accuracy for novel classes without extra annotations or re-training.
Contribution
The paper proposes a novel self-training strategy that selects proposals as background or novel classes, improving open-vocabulary detection without additional data or re-training.
Findings
Significant performance improvements on LVIS, V3Det, and COCO datasets.
Achieved 1.7% better AP on LVIS compared to F-VLM.
Reaches 46.7 novel class AP on COCO without extra data.
Abstract
Open-vocabulary object detection (OVOD) aims to detect the objects beyond the set of classes observed during training. This work introduces a straightforward and efficient strategy that utilizes pre-trained vision-language models (VLM), like CLIP, to identify potential novel classes through zero-shot classification. Previous methods use a class-agnostic region proposal network to detect object proposals and consider the proposals that do not match the ground truth as background. Unlike these methods, our method will select a subset of proposals that will be considered as background during the training. Then, we treat them as novel classes during training. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Compared to previous pseudo methods, our approach does not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsContrastive Language-Image Pre-training
