Sampling Bag of Views for Open-Vocabulary Object Detection
Hojun Choi, Junsuk Choe, Hyunjung Shim

TL;DR
This paper introduces a concept-based alignment method using a bag of related concepts for open-vocabulary object detection, improving accuracy and efficiency by better capturing contextual information and reducing computational costs.
Contribution
It proposes a novel concept grouping and scaling approach within a bag of views framework, enhancing compositional structure modeling for open-vocabulary detection.
Findings
Achieves 2.6 AP50 and 0.5 mask AP improvements on COCO and LVIS benchmarks.
Reduces CLIP FLOPs by 80.3%, significantly improving efficiency.
Outperforms previous state-of-the-art models on open-vocabulary detection datasets.
Abstract
Existing open-vocabulary object detection (OVD) develops methods for testing unseen categories by aligning object region embeddings with corresponding VLM features. A recent study leverages the idea that VLMs implicitly learn compositional structures of semantic concepts within the image. Instead of using an individual region embedding, it utilizes a bag of region embeddings as a new representation to incorporate compositional structures into the OVD task. However, this approach often fails to capture the contextual concepts of each region, leading to noisy compositional structures. This results in only marginal performance improvements and reduced efficiency. To address this, we propose a novel concept-based alignment method that samples a more powerful and efficient compositional structure. Our approach groups contextually related ``concepts'' into a bag and adjusts the scale of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsSoftmax · RoIPool · Convolution · Region Proposal Network · Faster R-CNN · Contrastive Language-Image Pre-training
