Open-vocabulary Object Detection via Vision and Language Knowledge   Distillation

Xiuye Gu; Tsung-Yi Lin; Weicheng Kuo; Yin Cui

arXiv:2104.13921·cs.CV·May 13, 2022·280 cites

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper introduces ViLD, a novel method for open-vocabulary object detection that leverages vision and language knowledge distillation from pretrained models, enabling detection of unseen categories with high accuracy.

Contribution

The paper proposes ViLD, a new training approach that distills knowledge from pretrained open-vocabulary classifiers into a detector, significantly improving open-vocabulary detection performance.

Findings

01

ViLD achieves 16.1 mask AP_r on LVIS with ResNet-50, surpassing supervised methods.

02

Using a stronger teacher model ALIGN, ViLD reaches 26.3 AP_r.

03

The model transfers well to other datasets, achieving high AP scores without fine-tuning.

Abstract

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsRegion Proposal Network · Convolution · RoIAlign · Softmax · Mask R-CNN