Enhancing Vision-Language Model with Unmasked Token Alignment

Jihao Liu; Jinliang Zheng; Boxiao Liu; Yu Liu; Hongsheng Li

arXiv:2405.19009·cs.CV·June 17, 2024·1 cites

Enhancing Vision-Language Model with Unmasked Token Alignment

Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces Unmasked Token Alignment (UTA), a novel method that enhances vision-language models like CLIP by aligning unmasked visual tokens with image tokens from a frozen CLIP encoder, improving efficiency and performance.

Contribution

UTA leverages existing CLIP models to improve vision-language representations without additional training on image-text pairs, addressing training efficiency and consistency issues.

Findings

01

UTA outperforms existing MIM methods on various benchmarks.

02

UTA enhances CLIP models effectively without additional image-text pair training.

03

UTA is more training-efficient by avoiding [MASK] tokens.

Abstract

Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jihaonew/uta
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsAttention Is All You Need · Byte Pair Encoding · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Absolute Position Encodings · Softmax · Layer Normalization