How Contextual are Contextualized Word Representations? Comparing the   Geometry of BERT, ELMo, and GPT-2 Embeddings

Kawin Ethayarajh

arXiv:1909.00512·cs.CL·September 4, 2019

How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings

Kawin Ethayarajh

PDF

1 Repo

TL;DR

This paper investigates the degree of context-specificity in BERT, ELMo, and GPT-2 embeddings, revealing that upper layers produce more context-dependent representations with limited static word influence.

Contribution

It provides a detailed geometric analysis of how contextualized embeddings differ across models and layers, highlighting the increasing context-specificity in upper layers.

Findings

01

Upper layers produce more context-specific representations.

02

Less than 5% of variance explained by static embeddings.

03

Representations are not isotropic in any layer.

Abstract

Replacing static word embeddings with contextualized word representations has yielded significant improvements on many NLP tasks. However, just how contextual are the contextualized representations produced by models such as ELMo and BERT? Are there infinitely many context-specific representations for each word, or are words essentially assigned one of a finite number of word-sense representations? For one, we find that the contextualized representations of all words are not isotropic in any layer of the contextualizing model. While representations of the same word in different contexts still have a greater cosine similarity than those of two different words, this self-similarity is much lower in upper layers. This suggests that upper layers of contextualizing models produce more context-specific representations, much like how upper layers of LSTMs produce more task-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sonsus/albert_paraphrase
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Cosine Annealing · Sigmoid Activation · Tanh Activation · Weight Decay · Residual Connection · Adam · Layer Normalization · Attention Is All You Need · Dropout