Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection
Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan,, Fahad Shahbaz Khan

TL;DR
This paper introduces a novel approach to open-vocabulary detection that aligns object and image-level representations using object-centric alignment of CLIP embeddings and pseudo-labeling, significantly improving detection performance on COCO and LVIS benchmarks.
Contribution
It proposes a new method combining object-centric alignment of CLIP embeddings with pseudo-labeling for image-level supervision, bridging the gap between object and image representations in open-vocabulary detection.
Findings
Achieves 36.6 AP50 on COCO novel classes, 8.2 points higher than previous best.
Surpasses state-of-the-art ViLD model by 5.0 mask AP on LVIS rare categories.
Effectively reduces the gap between object and image-centric representations.
Abstract
Existing open-vocabulary object detectors typically enlarge their vocabulary sizes by leveraging different forms of weak supervision. This helps generalize to novel objects at inference. Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision. We note that both these modes of supervision are not optimally aligned for the detection task: CLIP is trained with image-text pairs and lacks precise localization of objects while the image-level supervision has been used with heuristics that do not accurately specify local object regions. In this work, we propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model. Furthermore, we visually ground the objects with only image-level supervision using a pseudo-labeling process that provides high-quality object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsMultiscale Attention ViT with Late fusion · Contrastive Language-Image Pre-training
