Boosting Open-Vocabulary Object Detection by Handling Background Samples
Ruizhe Zeng, Lu Zhang, Xu Yang, Zhiyong Liu

TL;DR
This paper introduces BIRDet, a novel method that improves open-vocabulary object detection by better handling background samples through dynamic background modeling and partial object suppression, leading to enhanced detection performance.
Contribution
The paper proposes Background Information Representation (BIRDet), which incorporates dynamic scene background modeling and partial object suppression to address CLIP's limitations in handling background samples.
Findings
Improved detection accuracy on OV-COCO and OV-LVIS benchmarks.
Enhanced background classification capabilities in open-vocabulary detectors.
Demonstrated performance gains across various models using the proposed methods.
Abstract
Open-vocabulary object detection is the task of accurately detecting objects from a candidate vocabulary list that includes both base and novel categories. Currently, numerous open-vocabulary detectors have achieved success by leveraging the impressive zero-shot capabilities of CLIP. However, we observe that CLIP models struggle to effectively handle background images (i.e. images without corresponding labels) due to their language-image learning methodology. This limitation results in suboptimal performance for open-vocabulary detectors that rely on CLIP when processing background samples. In this paper, we propose Background Information Representation for open-vocabulary Detector (BIRDet), a novel approach to address the limitations of CLIP in handling background samples. Specifically, we design Background Information Modeling (BIM) to replace the single, fixed background embedding in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection · Contrastive Language-Image Pre-training
