SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in   Open-Vocabulary Detection

Zishuo Wang; Wenhao Zhou; Jinglin Xu; Yuxin Peng

arXiv:2410.05650·cs.CV·October 10, 2024

SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary Detection

Zishuo Wang, Wenhao Zhou, Jinglin Xu, Yuxin Peng

PDF

1 Repo

TL;DR

This paper introduces SIA-OVD, a shape-invariant adapter that improves open-vocabulary object detection by bridging the gap between image and region representations caused by shape deformation, leading to better classification accuracy.

Contribution

The paper proposes a novel shape-invariant adapter with an adapter allocation mechanism to align region features with text representations in OVD tasks.

Findings

01

SIA-OVD significantly improves classification accuracy on COCO-OVD benchmark.

02

The adapter effectively mitigates the image-region gap caused by shape deformation.

03

Extensive experiments validate the effectiveness of the proposed method.

Abstract

Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost. Existing OVD methods mainly rely on the powerful open-vocabulary image-text alignment capability of Vision-Language Pretrained Models (VLM) such as CLIP. However, CLIP is trained on image-text pairs and lacks the perceptual ability for local regions within an image, resulting in the gap between image and region representations. Directly using CLIP for OVD causes inaccurate region classification. We find the image-region gap is primarily caused by the deformation of region feature maps during region of interest (RoI) extraction. To mitigate the inaccurate region classification in OVD, we propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task. SIA-OVD learns a set of feature adapters for regions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-icst-mipl/sia-ovd_acmmm2024
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training · Adapter · ALIGN