TL;DR
This paper introduces a hierarchical confidence calibration method and LoCLIP, a parameter-efficient CLIP adaptation, to improve open-vocabulary object detection by enhancing class label reliability and objectness scoring.
Contribution
It proposes a novel pseudo labeling framework with hierarchical confidence calibration and LoCLIP, addressing label accuracy and objectness score issues in open-vocabulary detection.
Findings
Achieves state-of-the-art results on COCO and LVIS benchmarks.
Improves class label reliability through hierarchical semantic consistency.
Enhances objectness estimation for novel classes with LoCLIP.
Abstract
Conventional object detectors typically operate under a closed-set assumption, limiting recognition to a predefined set of base classes seen during training. Open-vocabulary object detection (OVD) addresses this limitation by leveraging vision-language models (VLMs) to generate pseudo labels for novel object classes. However, existing OVD methods suffer from two critical drawbacks: (1) inaccurate class label assignments, as VLMs are optimized for image-level predictions rather than the region-level predictions required for pseudo labeling, and (2) unreliable objectness scores from region proposal networks (RPNs) trained exclusively on base object classes. To address these issues, we propose a novel pseudo labeling framework for OVD. Our approach introduces a hierarchical confidence calibration (HCC) technique, which ensures reliable class label estimation by assessing consistency across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
