Localized Vision-Language Matching for Open-vocabulary Object Detection

Maria A. Bravo; Sudhanshu Mittal; Thomas Brox

arXiv:2205.06160·cs.CV·July 29, 2022·1 cites

Localized Vision-Language Matching for Open-vocabulary Object Detection

Maria A. Bravo, Sudhanshu Mittal, Thomas Brox

PDF

Open Access 1 Repo

TL;DR

This paper introduces a two-stage open-vocabulary object detection method that leverages image-caption pairs and a simple language model, achieving data-efficient detection of novel objects with improved consistency regularization.

Contribution

It proposes a novel location-guided image-caption matching approach and a consistency-regularization technique for open-vocabulary object detection, outperforming existing methods.

Findings

01

Simple language models outperform large contextualized models for novel object detection.

02

The method is more data-efficient than previous approaches.

03

It achieves favorable results compared to existing open-vocabulary detection methods.

Abstract

In this work, we propose an open-vocabulary object detection method that, based on image-caption pairs, learns to detect novel object classes along with a given set of known classes. It is a two-stage training approach that first uses a location-guided image-caption matching technique to learn class labels for both novel and known classes in a weakly-supervised manner and second specializes the model for the object detection task using known class annotations. We show that a simple language model fits better than a large contextualized language model for detecting novel objects. Moreover, we introduce a consistency-regularization technique to better exploit image-caption pair information. Our method compares favorably to existing open-vocabulary detection approaches while being data-efficient. Source code is available at https://github.com/lmb-freiburg/locov .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lmb-freiburg/locov
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques