Understanding Undesirable Word Embedding Associations
Kawin Ethayarajh, David Duvenaud, and Graeme Hirst

TL;DR
This paper investigates biases in word embeddings, showing that certain debiasing methods are equivalent to training on unbiased data, and introduces a new bias measure revealing that some models amplify stereotypes.
Contribution
It proves the equivalence of post hoc debiasing to unbiased training and introduces RIPA, a new measure for assessing word embedding bias.
Findings
Debiasing via subspace projection is theoretically equivalent to unbiased training under certain conditions.
WEAT overestimates bias systematically.
SGNS amplifies gender stereotypes for gender-stereotyped words.
Abstract
Word embeddings are often criticized for capturing undesirable word associations such as gender stereotypes. However, methods for measuring and removing such biases remain poorly understood. We show that for any embedding model that implicitly does matrix factorization, debiasing vectors post hoc using subspace projection (Bolukbasi et al., 2016) is, under certain conditions, equivalent to training on an unbiased corpus. We also prove that WEAT, the most common association test for word embeddings, systematically overestimates bias. Given that the subspace projection method is provably effective, we use it to derive a new measure of association called the (RIPA). Experiments with RIPA reveal that, on average, skipgram with negative sampling (SGNS) does not make most words any more gendered than they are in the training corpus. However, for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection
