Contrastive Visual Semantic Pretraining Magnifies the Semantics of Natural Language Representations
Robert Wolfe, Aylin Caliskan

TL;DR
Contrastive visual semantic pretraining, as exemplified by CLIP, enhances the semantic quality and reduces anisotropy in language representations, outperforming GPT-2 on multiple semantic benchmarks without fine-tuning.
Contribution
This paper demonstrates that contrastive visual semantic pretraining significantly improves the semantic properties of language models, reducing anisotropy and enhancing performance on semantic tasks.
Findings
CLIP word embeddings have low intra-layer self-similarity (<0.25), unlike GPT-2.
CLIP achieves state-of-the-art on RG65 semantic similarity with .88.
CLIP's sentence embeddings show decreasing self-similarity with layer depth, indicating richer semantic encoding.
Abstract
We examine the effects of contrastive visual semantic pretraining by comparing the geometry and semantic properties of contextualized English language representations formed by GPT-2 and CLIP, a zero-shot multimodal image classifier which adapts the GPT-2 architecture to encode image captions. We find that contrastive visual semantic pretraining significantly mitigates the anisotropy found in contextualized word embeddings from GPT-2, such that the intra-layer self-similarity (mean pairwise cosine similarity) of CLIP word embeddings is under .25 in all layers, compared to greater than .95 in the top layer of GPT-2. CLIP word embeddings outperform GPT-2 on word-level semantic intrinsic evaluation tasks, and achieve a new corpus-based state of the art for the RG65 evaluation, at .88. CLIP also forms fine-grained semantic representations of sentences, and obtains Spearman's rho = .73 on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Adam · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Softmax · Dense Connections · Residual Connection
