Zero-shot Object Detection Through Vision-Language Embedding Alignment
Johnathan Xie, Shuai Zheng

TL;DR
This paper introduces a vision-language embedding alignment method that enables zero-shot object detection by transferring the generalization capabilities of pretrained models like CLIP to detectors such as YOLOv5, achieving state-of-the-art results.
Contribution
The authors propose a novel loss function for aligning image and text embeddings with object detectors, allowing zero-shot detection without additional training on new classes.
Findings
Achieves state-of-the-art zero-shot detection on COCO, ILSVRC, and Visual Genome datasets.
Standard detection scaling transfers well, improving performance across different YOLO models.
Self-labeling enhances detection scores without extra data or labels.
Abstract
Recent approaches have shown that training deep neural networks directly on large-scale image-text pair collections enables zero-shot transfer on various recognition tasks. One central issue is how this can be generalized to object detection, which involves the non-semantic task of localization as well as semantic task of classification. To solve this problem, we introduce a vision-language embedding alignment method that transfers the generalization capabilities of a pretrained model such as CLIP to an object detector like YOLOv5. We formulate a loss function that allows us to align the image and text embeddings from the pretrained model CLIP with the modified semantic prediction head from the detector. With this method, we are able to train an object detector that achieves state-of-the-art performance on the COCO, ILSVRC, and Visual Genome zero-shot detection benchmarks. During…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsBNB Customer Service Number +1-833-534-1729 · Softmax · Batch Normalization · Average Pooling · Global Average Pooling · Residual Connection · Convolution · k-Means Clustering · 1x1 Convolution · Logistic Regression
