Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object   Detection

Yifan Xu; Mengdan Zhang; Xiaoshan Yang; Changsheng Xu

arXiv:2308.15846·cs.CV·August 31, 2023

Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

Yifan Xu, Mengdan Zhang, Xiaoshan Yang, Changsheng Xu

PDF

Open Access

TL;DR

This paper introduces MMC-Det, a novel framework that leverages multi-modal contextual knowledge through distillation to improve open-vocabulary object detection, effectively understanding novel categories by integrating visual and language cues.

Contribution

The paper proposes a multi-modal contextual knowledge distillation framework, MMC-Det, which transfers knowledge from a teacher transformer to a student detector for enhanced open-vocabulary detection.

Findings

01

Outperforms recent state-of-the-art methods on various datasets.

02

Effectively incorporates multi-modal contextual knowledge into object detection.

03

Demonstrates significant improvements in detecting novel categories.

Abstract

In this paper, we for the first time explore helpful multi-modal contextual knowledge to understand novel categories for open-vocabulary object detection (OVD). The multi-modal contextual knowledge stands for the joint relationship across regions and words. However, it is challenging to incorporate such multi-modal contextual knowledge into OVD. The reason is that previous detection frameworks fail to jointly model multi-modal contextual knowledge, as object detectors only support vision inputs and no caption description is provided at test time. To this end, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer with diverse multi-modal masked language modeling (D-MLM) to a student detector. The diverse multi-modal masked language modeling is realized by an object divergence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

Methodsfail · Knowledge Distillation