Problems with Cosine as a Measure of Embedding Similarity for High   Frequency Words

Kaitlyn Zhou; Kawin Ethayarajh; Dallas Card; Dan Jurafsky

arXiv:2205.05092·cs.CL·May 12, 2022·1 cites

Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Kaitlyn Zhou, Kawin Ethayarajh, Dallas Card, Dan Jurafsky

PDF

Open Access 2 Repos

TL;DR

This paper reveals that cosine similarity systematically underestimates the similarity of high-frequency words in BERT embeddings, due to differences in their representational geometry, impacting NLP tasks relying on such measures.

Contribution

It identifies and explains the systematic underestimation of high-frequency word similarities by cosine in BERT embeddings, linking it to frequency-related geometric differences.

Findings

01

Cosine similarity underestimates frequent word similarities compared to human judgments.

02

The underestimation is linked to differences in embedding geometry based on word frequency.

03

A formal geometric argument supports the observed effects.

Abstract

Cosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Weight Decay · Dropout · WordPiece · Layer Normalization · Softmax · Attention Dropout · Linear Warmup With Linear Decay