Bridging the Gap between Object and Image-level Representations for   Open-Vocabulary Detection

Hanoona Rasheed; Muhammad Maaz; Muhammad Uzair Khattak; Salman Khan,; Fahad Shahbaz Khan

arXiv:2207.03482·cs.CV·November 30, 2022·70 cites

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan,, Fahad Shahbaz Khan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel approach to open-vocabulary detection that aligns object and image-level representations using object-centric alignment of CLIP embeddings and pseudo-labeling, significantly improving detection performance on COCO and LVIS benchmarks.

Contribution

It proposes a new method combining object-centric alignment of CLIP embeddings with pseudo-labeling for image-level supervision, bridging the gap between object and image representations in open-vocabulary detection.

Findings

01

Achieves 36.6 AP50 on COCO novel classes, 8.2 points higher than previous best.

02

Surpasses state-of-the-art ViLD model by 5.0 mask AP on LVIS rare categories.

03

Effectively reduces the gap between object and image-centric representations.

Abstract

Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mmaaz60/mvits_for_class_agnostic_od
pytorch

Videos

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsMultiscale Attention ViT with Late fusion · Contrastive Language-Image Pre-training