The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection
Yazhe Wan, Changjae Oh (Queen Mary University of London)

TL;DR
This paper introduces Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning method that enhances vision-language models for open-vocabulary object detection by improving local feature alignment without extra inference costs.
Contribution
The paper proposes a novel decoupled fine-tuning approach (DAT) that improves VLMs for open-vocabulary detection by focusing on local features while maintaining global knowledge.
Findings
DAT improves detection of novel objects on COCO and LVIS datasets.
DAT achieves state-of-the-art performance in cooperative open-vocabulary detection.
Fine-tuning less than 0.8M parameters yields significant gains without inference overhead.
Abstract
Open-vocabulary object detection aims to recognize objects from an open set of categories, which leverages vision-language models (VLMs) pre-trained on large-scale image-text data. The cooperative paradigm combines an object detector with a VLM to achieve zero-shot recognition of novel objects. However, VLMs pre-trained on full images often struggle to capture local object details, limiting their effectiveness when applied to region-level detection. We present Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach to improve VLMs for cooperative model-based object detection. Given a cooperative model consists of a closed-set detector and a VLM, we first construct a region-aware pseudo-labeled dataset using a pre-trained closed-set object detector, in which regions corresponding to novel objects may be present but remain unlabeled or mislabeled. We then fine-tune the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
