Attentive Mask CLIP

Yifan Yang; Weiquan Huang; Yixuan Wei; Houwen Peng; Xinyang Jiang,; Huiqiang Jiang; Fangyun Wei; Yin Wang; Han Hu; Lili Qiu; Yuqing Yang

arXiv:2212.08653·cs.CV·October 10, 2023

Attentive Mask CLIP

Yifan Yang, Weiquan Huang, Yixuan Wei, Houwen Peng, Xinyang Jiang,, Huiqiang Jiang, Fangyun Wei, Yin Wang, Han Hu, Lili Qiu, Yuqing Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces an attentive token removal method for CLIP training that selectively retains semantically relevant image tokens, improving accuracy and efficiency over random masking and other methods.

Contribution

The proposed attentive masking approach enhances CLIP training by selectively removing tokens based on semantic relevance, leading to better performance and efficiency.

Findings

01

Achieves 43.9% top-1 accuracy on ImageNet-1K zero-shot classification.

02

Outperforms previous methods like SLIP in retrieval tasks.

03

Runs 2.30x faster than plain CLIP with significant accuracy gains.

Abstract

Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/a-clip
pytorchOfficial

Videos

Attentive Mask CLIP· youtube

Taxonomy

TopicsCOVID-19 diagnosis using AI · Multimodal Machine Learning Applications · AI in cancer detection

MethodsContrastive Language-Image Pre-training · Contrastive Learning