Attentive Mask CLIP
Yifan Yang, Weiquan Huang, Yixuan Wei, Houwen Peng, Xinyang Jiang,, Huiqiang Jiang, Fangyun Wei, Yin Wang, Han Hu, Lili Qiu, Yuqing Yang

TL;DR
This paper introduces an attentive token removal method for CLIP training that selectively retains semantically relevant image tokens, improving accuracy and efficiency over random masking and other methods.
Contribution
The proposed attentive masking approach enhances CLIP training by selectively removing tokens based on semantic relevance, leading to better performance and efficiency.
Findings
Achieves 43.9% top-1 accuracy on ImageNet-1K zero-shot classification.
Outperforms previous methods like SLIP in retrieval tasks.
Runs 2.30x faster than plain CLIP with significant accuracy gains.
Abstract
Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Attentive Mask CLIP· youtube
Taxonomy
TopicsCOVID-19 diagnosis using AI · Multimodal Machine Learning Applications · AI in cancer detection
MethodsContrastive Language-Image Pre-training · Contrastive Learning
