Modulating CNN Features with Pre-Trained ViT Representations for   Open-Vocabulary Object Detection

Xiangyu Gao; Yu Dai; Benliu Qiu; Lanxiao Wang; Heqian Qiu; Hongliang; Li

arXiv:2501.16981·cs.CV·March 7, 2025

Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection

Xiangyu Gao, Yu Dai, Benliu Qiu, Lanxiao Wang, Heqian Qiu, Hongliang, Li

PDF

Open Access

TL;DR

This paper introduces VMCNet, a novel two-branch network that combines trainable CNN features with frozen pre-trained ViT representations to improve open-vocabulary object detection, especially for novel categories.

Contribution

VMCNet's innovative architecture effectively modulates CNN features with ViT representations, enhancing detection of unseen object categories beyond existing methods.

Findings

01

Achieves state-of-the-art performance on OV-COCO and OV-LVIS benchmarks.

02

Significantly improves detection AP for novel categories.

03

Demonstrates the effectiveness of combining trainable CNNs with frozen ViT features.

Abstract

Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLMs to attain generalized representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, its frozen backbone doesn't benefit from the labeled data to strengthen the representation for detection. Therefore, we propose a novel two-branch backbone network, named as \textbf{V}iT-Feature-\textbf{M}odulated Multi-Scale \textbf{C}onvolutional Network (VMCNet), which consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a VMC module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsContrastive Language-Image Pre-training