Understanding the Origins of Bias in Word Embeddings
Marc-Etienne Brunet, Colleen Alkalay-Houlihan, Ashton Anderson,, Richard Zemel

TL;DR
This paper introduces a method to trace and understand the origins of bias in word embeddings by analyzing how changes in training data influence bias, aiding in bias mitigation.
Contribution
The authors develop a technique using influence functions to identify how perturbations in training data affect bias in word embeddings, providing insights into bias sources.
Findings
Influence function-based approximations are highly accurate.
Bias in embeddings can be traced back to specific training documents.
Removing certain documents reduces bias significantly.
Abstract
The power of machine learning systems not only promises great technical progress, but risks societal harm. As a recent example, researchers have shown that popular word embedding algorithms exhibit stereotypical biases, such as gender bias. The widespread use of these algorithms in machine learning systems, from automated translation services to curriculum vitae scanners, can amplify stereotypes in important contexts. Although methods have been developed to measure these biases and alter word embeddings to mitigate their biased representations, there is a lack of understanding in how word embedding bias depends on the training data. In this work, we develop a technique for understanding the origins of bias in word embeddings. Given a word embedding trained on a corpus, our method identifies how perturbing the corpus will affect the bias of the resulting embedding. This can be used to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Natural Language Processing Techniques
