Learning Object-Language Alignments for Open-Vocabulary Object Detection
Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza, Haffari, Zehuan Yuan, Jianfei Cai

TL;DR
This paper introduces a new framework for open-vocabulary object detection that learns directly from image-text pairs by formulating object-language alignment as a set matching problem, enabling effective detection of novel categories without extensive annotations.
Contribution
It proposes a novel set matching approach for object-language alignment, allowing training on image-text data for open-vocabulary detection without costly grounding annotations.
Findings
Achieves 32.0% mAP on COCO for novel categories.
Attains 21.7% mask mAP on LVIS for new categories.
Outperforms previous methods on benchmark datasets.
Abstract
Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging since image-text pairs do not contain fine-grained object-language alignments. Previous solutions rely on either expensive grounding annotations or distilling classification-oriented vision models. In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data. We formulate object-language alignment as a set matching problem between a set of image region features and a set of word embeddings. It enables us to train an open-vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
