Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
Yifan Xu, Mengdan Zhang, Xiaoshan Yang, Changsheng Xu

TL;DR
This paper introduces MMC-Det, a novel framework that leverages multi-modal contextual knowledge through distillation to improve open-vocabulary object detection, effectively understanding novel categories by integrating visual and language cues.
Contribution
The paper proposes a multi-modal contextual knowledge distillation framework, MMC-Det, which transfers knowledge from a teacher transformer to a student detector for enhanced open-vocabulary detection.
Findings
Outperforms recent state-of-the-art methods on various datasets.
Effectively incorporates multi-modal contextual knowledge into object detection.
Demonstrates significant improvements in detecting novel categories.
Abstract
In this paper, we for the first time explore helpful multi-modal contextual knowledge to understand novel categories for open-vocabulary object detection (OVD). The multi-modal contextual knowledge stands for the joint relationship across regions and words. However, it is challenging to incorporate such multi-modal contextual knowledge into OVD. The reason is that previous detection frameworks fail to jointly model multi-modal contextual knowledge, as object detectors only support vision inputs and no caption description is provided at test time. To this end, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer with diverse multi-modal masked language modeling (D-MLM) to a student detector. The diverse multi-modal masked language modeling is realized by an object divergence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
Methodsfail · Knowledge Distillation
