Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

TL;DR
This paper introduces ViLD, a novel method for open-vocabulary object detection that leverages vision and language knowledge distillation from pretrained models, enabling detection of unseen categories with high accuracy.
Contribution
The paper proposes ViLD, a new training approach that distills knowledge from pretrained open-vocabulary classifiers into a detector, significantly improving open-vocabulary detection performance.
Findings
ViLD achieves 16.1 mask AP_r on LVIS with ResNet-50, surpassing supervised methods.
Using a stronger teacher model ALIGN, ViLD reaches 26.3 AP_r.
The model transfers well to other datasets, achieving high AP scores without fine-tuning.
Abstract
We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsRegion Proposal Network · Convolution · RoIAlign · Softmax · Mask R-CNN
