Contrastive Feature Masking Open-Vocabulary Vision Transformer
Dahun Kim, Anelia Angelova, Weicheng Kuo

TL;DR
CFM-ViT introduces a novel image-text pretraining method combining contrastive learning and masked autoencoding in joint embedding space, enhancing open-vocabulary object detection and zero-shot image-text retrieval.
Contribution
It proposes Contrastive Feature Masking (CFM) in ViT, integrating MAE with contrastive learning and introducing Positional Embedding Dropout to improve localization and open-vocabulary detection.
Findings
Achieves 33.9 AP on LVIS benchmark, surpassing previous methods.
Improves zero-shot detection transfer and image-text retrieval performance.
Enables using a frozen ViT backbone for detection tasks.
Abstract
We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Contrastive Feature Masking Open-Vocabulary Vision Transformer· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Layer Normalization · Adam · Masked autoencoder
