TL;DR
DetRefiner is a plug-and-play framework that fuses global and local features via a lightweight Transformer to improve open-vocabulary object detection, enhancing performance without retraining base models.
Contribution
It introduces a model-agnostic, training-free method to refine detection confidence by integrating global and local contextual cues using feature fusion transformers.
Findings
Achieves up to +10.1 AP improvement on novel categories.
Enhances multiple OVOD models across datasets like COCO and LVIS.
Operates solely on base detector predictions without retraining.
Abstract
Open-vocabulary object detection (OVOD) aims to detect both seen and unseen categories, yet existing methods often struggle to generalize to novel objects due to limited integration of global and local contextual cues. We propose DetRefiner, a simple yet effective plug-and-play framework that learns to fuse global and local features to refine open-vocabulary detection. DetRefiner processes global image features and patch-level image features from foundational models (e.g., DINOv3) through a lightweight Transformer encoder. The encoder produces a class vector capturing image-level attributes and patch vectors representing local region attributes, from which attribute reliability is inferred to recalibrate the base model's confidence. Notably, DetRefiner is trained independently of the base OVOD model, requiring neither access to its internal features nor retraining. At inference, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
