TL;DR
This paper investigates the degree of context-specificity in BERT, ELMo, and GPT-2 embeddings, revealing that upper layers produce more context-dependent representations with limited static word influence.
Contribution
It provides a detailed geometric analysis of how contextualized embeddings differ across models and layers, highlighting the increasing context-specificity in upper layers.
Findings
Upper layers produce more context-specific representations.
Less than 5% of variance explained by static embeddings.
Representations are not isotropic in any layer.
Abstract
Replacing static word embeddings with contextualized word representations has yielded significant improvements on many NLP tasks. However, just how contextual are the contextualized representations produced by models such as ELMo and BERT? Are there infinitely many context-specific representations for each word, or are words essentially assigned one of a finite number of word-sense representations? For one, we find that the contextualized representations of all words are not isotropic in any layer of the contextualizing model. While representations of the same word in different contexts still have a greater cosine similarity than those of two different words, this self-similarity is much lower in upper layers. This suggests that upper layers of contextualizing models produce more context-specific representations, much like how upper layers of LSTMs produce more task-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Cosine Annealing · Sigmoid Activation · Tanh Activation · Weight Decay · Residual Connection · Adam · Layer Normalization · Attention Is All You Need · Dropout
