BERT Has Uncommon Sense: Similarity Ranking for Word Sense BERTology
Luke Gessler, Nathan Schneider

TL;DR
This paper evaluates how well BERT and similar models represent different word senses, especially rare ones, by analyzing their embedding neighborhoods without explicit sense supervision, revealing significant variability among models.
Contribution
It introduces a neighborhood-based retrieval method to assess sense representation in CWE models, highlighting differences in performance, particularly for uncommon senses.
Findings
CWE models outperform random baselines on sense ranking.
Performance varies significantly among models, especially for rare senses.
Models differ in their ability to approximate word senses without supervision.
Abstract
An important question concerning contextualized word embedding (CWE) models like BERT is how well they can represent different word senses, especially those in the long tail of uncommon senses. Rather than build a WSD system as in previous work, we investigate contextualized embedding neighborhoods directly, formulating a query-by-example nearest neighbor retrieval task and examining ranking performance for words and senses in different frequency bands. In an evaluation on two English sense-annotated corpora, we find that several popular CWE models all outperform a random baseline even for proportionally rare senses, without explicit sense supervision. However, performance varies considerably even among models with similar architectures and pretraining regimes, with especially large differences for rare word senses, revealing that CWE models are not all created equal when it comes to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Dense Connections · Multi-Head Attention · Softmax · Linear Warmup With Linear Decay · Dropout · Attention Dropout · Weight Decay
