Contrastive Feature Masking Open-Vocabulary Vision Transformer

Dahun Kim; Anelia Angelova; Weicheng Kuo

arXiv:2309.00775·cs.CV·September 6, 2023

Contrastive Feature Masking Open-Vocabulary Vision Transformer

Dahun Kim, Anelia Angelova, Weicheng Kuo

PDF

Open Access 1 Video

TL;DR

CFM-ViT introduces a novel image-text pretraining method combining contrastive learning and masked autoencoding in joint embedding space, enhancing open-vocabulary object detection and zero-shot image-text retrieval.

Contribution

It proposes Contrastive Feature Masking (CFM) in ViT, integrating MAE with contrastive learning and introducing Positional Embedding Dropout to improve localization and open-vocabulary detection.

Findings

01

Achieves 33.9 AP on LVIS benchmark, surpassing previous methods.

02

Improves zero-shot detection transfer and image-text retrieval performance.

03

Enables using a frozen ViT backbone for detection tasks.

Abstract

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Contrastive Feature Masking Open-Vocabulary Vision Transformer· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Layer Normalization · Adam · Masked autoencoder