The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection

Yazhe Wan; Changjae Oh (Queen Mary University of London)

arXiv:2605.03642·cs.CV·May 6, 2026

The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection

Yazhe Wan, Changjae Oh (Queen Mary University of London)

PDF

TL;DR

This paper introduces Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning method that enhances vision-language models for open-vocabulary object detection by improving local feature alignment without extra inference costs.

Contribution

The paper proposes a novel decoupled fine-tuning approach (DAT) that improves VLMs for open-vocabulary detection by focusing on local features while maintaining global knowledge.

Findings

01

DAT improves detection of novel objects on COCO and LVIS datasets.

02

DAT achieves state-of-the-art performance in cooperative open-vocabulary detection.

03

Fine-tuning less than 0.8M parameters yields significant gains without inference overhead.

Abstract

Open-vocabulary object detection aims to recognize objects from an open set of categories, which leverages vision-language models (VLMs) pre-trained on large-scale image-text data. The cooperative paradigm combines an object detector with a VLM to achieve zero-shot recognition of novel objects. However, VLMs pre-trained on full images often struggle to capture local object details, limiting their effectiveness when applied to region-level detection. We present Decoupled Adaptivity Training (DAT), a self-supervised fine-tuning approach to improve VLMs for cooperative model-based object detection. Given a cooperative model consists of a closed-set detector and a VLM, we first construct a region-aware pseudo-labeled dataset using a pre-trained closed-set object detector, in which regions corresponding to novel objects may be present but remain unlabeled or mislabeled. We then fine-tune the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.