DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection
Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang,, Zhenguo Li, Chunjing Xu, Hang Xu

TL;DR
DetCLIP introduces a knowledge-enriched, dictionary-based pre-training approach for open-world object detection, significantly improving zero-shot detection performance by leveraging concept descriptions and relationships.
Contribution
It proposes a novel parallel concept formulation and a comprehensive concept dictionary to enhance open-world detection and zero-shot learning capabilities.
Findings
DetCLIP-T outperforms GLIP-T by 9.9% mAP on LVIS.
Achieves 13.5% improvement on rare categories.
Demonstrates strong zero-shot detection performance.
Abstract
Open-world object detection, as a more general and challenging goal, aims to recognize and localize objects described by arbitrary category names. The recent work GLIP formulates this problem as a grounding problem by concatenating all category names of detection datasets into sentences, which leads to inefficient interaction between category names. This paper presents DetCLIP, a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary. To achieve better learning efficiency, we propose a novel paralleled concept formulation that extracts concepts separately to better utilize heterogeneous datasets (i.e., detection, grounding, and image-text pairs) for training. We further design a concept dictionary~(with descriptions) from various online sources and detection datasets to provide prior knowledge for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
