Enhancing Vision-Language Model with Unmasked Token Alignment
Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li

TL;DR
This paper introduces Unmasked Token Alignment (UTA), a novel method that enhances vision-language models like CLIP by aligning unmasked visual tokens with image tokens from a frozen CLIP encoder, improving efficiency and performance.
Contribution
UTA leverages existing CLIP models to improve vision-language representations without additional training on image-text pairs, addressing training efficiency and consistency issues.
Findings
UTA outperforms existing MIM methods on various benchmarks.
UTA enhances CLIP models effectively without additional image-text pair training.
UTA is more training-efficient by avoiding [MASK] tokens.
Abstract
Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsAttention Is All You Need · Byte Pair Encoding · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Absolute Position Encodings · Softmax · Layer Normalization
