On the Difference of BERT-style and CLIP-style Text Encoders

Zhihong Chen; Guiming Hardy Chen; Shizhe Diao; Xiang Wan; Benyou Wang

arXiv:2306.03678·cs.CL·June 7, 2023·1 cites

On the Difference of BERT-style and CLIP-style Text Encoders

Zhihong Chen, Guiming Hardy Chen, Shizhe Diao, Xiang Wan, Benyou Wang

PDF

Open Access 1 Repo

TL;DR

This paper compares BERT-style and CLIP-style text encoders, revealing that CLIP encoders excel in cross-modal association despite underperforming in general NLP tasks, highlighting their unique synesthetic ability.

Contribution

It provides a comprehensive analysis of CLIP-style text encoders, emphasizing their cross-modal capabilities and differences from traditional BERT-style models.

Findings

01

CLIP-style encoders underperform in general text understanding

02

CLIP encoders exhibit synesthesia for cross-modal association

03

BERT encoders outperform in traditional NLP tasks

Abstract

Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERT-style ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhjohnchan/bert-clip-synesthesia
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Speech and dialogue systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Residual Connection · Linear Layer · Dropout · Linear Warmup With Linear Decay · Adam · Attention Dropout · Layer Normalization