MarvelOVD: Marrying Object Recognition and Vision-Language Models for   Robust Open-Vocabulary Object Detection

Kuo Wang; Lechao Cheng; Weikai Chen; Pingping Zhang; Liang Lin; Fan; Zhou; Guanbin Li

arXiv:2407.21465·cs.CV·August 1, 2024

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Kuo Wang, Lechao Cheng, Weikai Chen, Pingping Zhang, Liang Lin, Fan, Zhou, Guanbin Li

PDF

Open Access 1 Repo

TL;DR

MarvelOVD introduces a novel approach that combines object detection and vision-language models to improve open-vocabulary detection by refining pseudo-labels and addressing bias through online learning and stratified label assignment.

Contribution

The paper proposes MarvelOVD, a new paradigm that enhances pseudo-label quality and mitigates bias in open-vocabulary detection by integrating detector guidance with vision-language models.

Findings

01

Outperforms state-of-the-art methods on COCO and LVIS datasets.

02

Effectively reduces noisy pseudo-labels through Online Mining and Adaptive Reweighting.

03

Addresses the base-novel-conflict problem with stratified label assignments.

Abstract

Learning from pseudo-labels that generated with VLMs~(Vision Language Models) has been shown as a promising solution to assist open vocabulary detection (OVD) in recent studies. However, due to the domain gap between VLM and vision-detection tasks, pseudo-labels produced by the VLMs are prone to be noisy, while the training design of the detector further amplifies the bias. In this work, we investigate the root cause of VLMs' biased prediction under the OVD context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets and optimizes the learning procedure in an online manner by marrying the capability of the detector with the vision-language model. Our key insight is that the detector itself can act as a strong auxiliary guidance to accommodate VLM's inability of understanding both the ``background'' and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wkfdb/marvelovd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques