From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets

Manuel Tonneau; Diyi Liu; Samuel Fraiberger; Ralph Schroeder; Scott A. Hale; Paul R\"ottger

arXiv:2404.17874·cs.CL·May 20, 2025

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets

Manuel Tonneau, Diyi Liu, Samuel Fraiberger, Ralph Schroeder, Scott A. Hale, Paul R\"ottger

PDF

Open Access 1 Repo

TL;DR

This paper investigates cultural biases in hate speech datasets across different languages and geographies, revealing overrepresentation of certain countries and providing recommendations for more culturally balanced data collection.

Contribution

It introduces a systematic evaluation of cultural bias in hate speech datasets using language and geographical metadata, highlighting biases and suggesting improvements.

Findings

01

English-language bias in datasets has decreased over recent years.

02

Hate speech datasets for English, Arabic, and Spanish are overrepresented by specific countries.

03

Significant geo-cultural bias exists, with datasets overrepresenting US and UK for English.

Abstract

Perceptions of hate can vary greatly across cultural contexts. Hate speech (HS) datasets, however, have traditionally been developed by language. This hides potential cultural biases, as one language may be spoken in different countries home to different cultures. In this work, we evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography. We conduct a systematic survey of HS datasets in eight languages and confirm past findings on their English-language bias, but also show that this bias has been steadily decreasing in the past few years. For three geographically-widespread languages -- English, Arabic and Spanish -- we then leverage geographical metadata from tweets to approximate geo-cultural contexts by pairing language and country information. We find that HS datasets for these languages exhibit a strong geo-cultural bias, largely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

manueltonneau/hs-survey-cultural-bias
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection