What is in a name? Mitigating Name Bias in Text Embeddings via Anonymization
Sahil Manchanda, Pannaga Shivaswamy

TL;DR
This paper identifies and mitigates name bias in text-embeddings by proposing a text anonymization method that removes names during inference, improving the accuracy of semantic similarity assessments.
Contribution
It introduces a simple, training-free anonymization technique to reduce name bias in text-embedding models, enhancing downstream NLP task performance.
Findings
Name bias exists in various text-embedding models.
Anonymization improves semantic similarity accuracy.
Method is practical and easy to implement.
Abstract
Text-embedding models often exhibit biases arising from the data on which they are trained. In this paper, we examine a hitherto unexplored bias in text-embeddings: bias arising from the presence of such as persons, locations, organizations etc. in the text. Our study shows how the presence of in text-embedding models can potentially lead to erroneous conclusions in assessment of thematic similarity.Text-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic content has no similarity or indicate dissimilarity simply because of the names in the text even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose during inference which involves removing references to names, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling · Data Quality and Management
