Scaling Open-Vocabulary Object Detection
Matthias Minderer, Alexey Gritsenko, Neil Houlsby

TL;DR
This paper introduces OWLv2 and OWL-ST, scalable methods for open-vocabulary object detection using web data, significantly improving performance on rare classes by leveraging self-training at unprecedented scales.
Contribution
The paper presents the OWLv2 model and OWL-ST self-training method, enabling scalable open-vocabulary detection with large-scale web data, surpassing previous state-of-the-art results.
Findings
OWLv2 outperforms previous detectors at similar training scales.
OWL-ST scales training to over 1 billion examples, greatly improving rare class detection.
AP on LVIS rare classes increased from 31.2% to 44.6%.
Abstract
Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/owlv2-base-patch16model· 81k dl· ♡ 3081k dl♡ 30
- 🤗google/owlv2-base-patch16-ensemblemodel· 978k dl· ♡ 117978k dl♡ 117
- 🤗google/owlv2-base-patch16-finetunedmodel· 1.1k dl· ♡ 31.1k dl♡ 3
- 🤗google/owlv2-large-patch14model· 936 dl· ♡ 9936 dl♡ 9
- 🤗google/owlv2-large-patch14-ensemblemodel· 228k dl· ♡ 37228k dl♡ 37
- 🤗google/owlv2-large-patch14-finetunedmodel· 752 dl· ♡ 6752 dl♡ 6
- 🤗facebook/dpt-dinov2-small-nyumodel· 135 dl· ♡ 3135 dl♡ 3
- 🤗upfeatmediainc/owlv2-base-patch16-ensemblemodel· 9 dl9 dl
- 🤗Thomasboosinger/owlv2-base-patch16-ensemblemodel· 12 dl12 dl
- 🤗Thomasboosinger/owlv2-large-patch14-ensemblemodel· 11 dl11 dl
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
