Scaling Open-Vocabulary Object Detection

Matthias Minderer; Alexey Gritsenko; Neil Houlsby

arXiv:2306.09683·cs.CV·May 24, 2024·28 cites

Scaling Open-Vocabulary Object Detection

Matthias Minderer, Alexey Gritsenko, Neil Houlsby

PDF

Open Access 3 Repos 10 Models 1 Video

TL;DR

This paper introduces OWLv2 and OWL-ST, scalable methods for open-vocabulary object detection using web data, significantly improving performance on rare classes by leveraging self-training at unprecedented scales.

Contribution

The paper presents the OWLv2 model and OWL-ST self-training method, enabling scalable open-vocabulary detection with large-scale web data, surpassing previous state-of-the-art results.

Findings

01

OWLv2 outperforms previous detectors at similar training scales.

02

OWL-ST scales training to over 1 billion examples, greatly improving rare class detection.

03

AP on LVIS rare classes increased from 31.2% to 44.6%.

Abstract

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Scaling Open-Vocabulary Object Detection· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques