Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection
Xiangyu Gao, Yu Dai, Benliu Qiu, Lanxiao Wang, Heqian Qiu, Hongliang, Li

TL;DR
This paper introduces VMCNet, a novel two-branch network that combines trainable CNN features with frozen pre-trained ViT representations to improve open-vocabulary object detection, especially for novel categories.
Contribution
VMCNet's innovative architecture effectively modulates CNN features with ViT representations, enhancing detection of unseen object categories beyond existing methods.
Findings
Achieves state-of-the-art performance on OV-COCO and OV-LVIS benchmarks.
Significantly improves detection AP for novel categories.
Demonstrates the effectiveness of combining trainable CNNs with frozen ViT features.
Abstract
Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLMs to attain generalized representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, its frozen backbone doesn't benefit from the labeled data to strengthen the representation for detection. Therefore, we propose a novel two-branch backbone network, named as \textbf{V}iT-Feature-\textbf{M}odulated Multi-Scale \textbf{C}onvolutional Network (VMCNet), which consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a VMC module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsContrastive Language-Image Pre-training
