On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
Tarek Naous, Wei Xu

TL;DR
This paper investigates the origins of cultural biases in language models, focusing on how pre-training data and linguistic variations contribute, and introduces a new benchmark to evaluate these biases across Arabic and English.
Contribution
It introduces CAMeL-2, a bilingual benchmark for assessing cultural biases in language models, and analyzes how data representation and tokenization affect bias in Arabic and English.
Findings
Language models perform better in English than Arabic on cultural entity recognition.
High-frequency entities in Arabic pose challenges due to multiple senses and script overlap.
Frequency-based tokenization exacerbates biases, especially with larger vocabularies.
Abstract
Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution
