TL;DR
This paper introduces SIA-OVD, a shape-invariant adapter that improves open-vocabulary object detection by bridging the gap between image and region representations caused by shape deformation, leading to better classification accuracy.
Contribution
The paper proposes a novel shape-invariant adapter with an adapter allocation mechanism to align region features with text representations in OVD tasks.
Findings
SIA-OVD significantly improves classification accuracy on COCO-OVD benchmark.
The adapter effectively mitigates the image-region gap caused by shape deformation.
Extensive experiments validate the effectiveness of the proposed method.
Abstract
Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost. Existing OVD methods mainly rely on the powerful open-vocabulary image-text alignment capability of Vision-Language Pretrained Models (VLM) such as CLIP. However, CLIP is trained on image-text pairs and lacks the perceptual ability for local regions within an image, resulting in the gap between image and region representations. Directly using CLIP for OVD causes inaccurate region classification. We find the image-region gap is primarily caused by the deformation of region feature maps during region of interest (RoI) extraction. To mitigate the inaccurate region classification in OVD, we propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task. SIA-OVD learns a set of feature adapters for regions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training · Adapter · ALIGN
