What is in a name? Mitigating Name Bias in Text Embeddings via   Anonymization

Sahil Manchanda; Pannaga Shivaswamy

arXiv:2502.02903·cs.CL·February 6, 2025

What is in a name? Mitigating Name Bias in Text Embeddings via Anonymization

Sahil Manchanda, Pannaga Shivaswamy

PDF

Open Access

TL;DR

This paper identifies and mitigates name bias in text-embeddings by proposing a text anonymization method that removes names during inference, improving the accuracy of semantic similarity assessments.

Contribution

It introduces a simple, training-free anonymization technique to reduce name bias in text-embedding models, enhancing downstream NLP task performance.

Findings

01

Name bias exists in various text-embedding models.

02

Anonymization improves semantic similarity accuracy.

03

Method is practical and easy to implement.

Abstract

Text-embedding models often exhibit biases arising from the data on which they are trained. In this paper, we examine a hitherto unexplored bias in text-embeddings: bias arising from the presence of $names$ such as persons, locations, organizations etc. in the text. Our study shows how the presence of $name-bias$ in text-embedding models can potentially lead to erroneous conclusions in assessment of thematic similarity.Text-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic content has no similarity or indicate dissimilarity simply because of the names in the text even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose $text-anonymization$ during inference which involves removing references to names, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Topic Modeling · Data Quality and Management