What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Jincheng Li; Chunyu Xie; Xiaoyu Wu; Bin Wang; Dawei Leng

arXiv:2309.00227·cs.CV·September 4, 2023

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Jincheng Li, Chunyu Xie, Xiaoyu Wu, Bin Wang, Dawei Leng

PDF

Open Access

TL;DR

This paper dissects the components of open-vocabulary detection, emphasizing the importance of both localization and classification, and proposes methods that improve performance by decoupling and combining these aspects.

Contribution

It introduces a comprehensive analysis of open-vocabulary detection methods, proposing decoupled and coupled approaches that enhance localization and classification performance.

Findings

01

DRR method achieves 35.8 Novel AP$_{50}$ on OVD-COCO

02

DRR surpasses previous SOTA by 1.9 AP$_{50}$ on OVD-LVIS

03

Extensive experiments validate the effectiveness of the proposed approaches

Abstract

Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary. This is challenging since traditional detectors can only learn from pre-defined categories and thus fail to detect and localize objects out of pre-defined vocabulary. To handle the challenge, OVD leverages pre-trained cross-modal VLM, such as CLIP, ALIGN, etc. Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part. We argue that for a good OVD detector, both classification and localization should be parallelly studied for the novel object categories. We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly. We analyze three families of OVD methods with different design emphases. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

Methodsfail · Focus · Contrastive Language-Image Pre-training · Region Proposal Network · RoIAlign · ALIGN