Sampling Bag of Views for Open-Vocabulary Object Detection

Hojun Choi; Junsuk Choe; Hyunjung Shim

arXiv:2412.18273·cs.CV·December 25, 2024

Sampling Bag of Views for Open-Vocabulary Object Detection

Hojun Choi, Junsuk Choe, Hyunjung Shim

PDF

Open Access

TL;DR

This paper introduces a concept-based alignment method using a bag of related concepts for open-vocabulary object detection, improving accuracy and efficiency by better capturing contextual information and reducing computational costs.

Contribution

It proposes a novel concept grouping and scaling approach within a bag of views framework, enhancing compositional structure modeling for open-vocabulary detection.

Findings

01

Achieves 2.6 AP50 and 0.5 mask AP improvements on COCO and LVIS benchmarks.

02

Reduces CLIP FLOPs by 80.3%, significantly improving efficiency.

03

Outperforms previous state-of-the-art models on open-vocabulary detection datasets.

Abstract

Existing open-vocabulary object detection (OVD) develops methods for testing unseen categories by aligning object region embeddings with corresponding VLM features. A recent study leverages the idea that VLMs implicitly learn compositional structures of semantic concepts within the image. Instead of using an individual region embedding, it utilizes a bag of region embeddings as a new representation to incorporate compositional structures into the OVD task. However, this approach often fails to capture the contextual concepts of each region, leading to noisy compositional structures. This results in only marginal performance improvements and reduced efficiency. To address this, we propose a novel concept-based alignment method that samples a more powerful and efficient compositional structure. Our approach groups contextually related ``concepts'' into a bag and adjusts the scale of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · RoIPool · Convolution · Region Proposal Network · Faster R-CNN · Contrastive Language-Image Pre-training