A Study of Nationality Bias in Names and Perplexity using Off-the-Shelf Affect-related Tweet Classifiers
Valentin Barriere, Sebastian Cifuentes

TL;DR
This study investigates how country names influence affect-related tweet classifier predictions, revealing biases linked to language and training data, especially affecting English and less-resourced languages.
Contribution
It introduces a counterfactual perturbation method for bias detection in classifiers and analyzes the impact of country names on predictions across multiple affect-related tasks.
Findings
Country names significantly affect classifier predictions, up to 23% in hate speech detection.
Biases are linked to training data of pre-trained language models, especially for English.
Correlations between affect predictions and language likelihoods reveal language-specific biases.
Abstract
In this paper, we apply a method to quantify biases associated with named entities from various countries. We create counterfactual examples with small perturbations on target-domain data instead of relying on templates or specific datasets for bias detection. On widely used classifiers for subjectivity analysis, including sentiment, emotion, hate speech, and offensive text using Twitter data, our results demonstrate positive biases related to the language spoken in a country across all classifiers studied. Notably, the presence of certain country names in a sentence can strongly influence predictions, up to a 23\% change in hate speech detection and up to a 60\% change in the prediction of negative emotions such as anger. We hypothesize that these biases stem from the training data of pre-trained language models (PLMs) and find correlations between affect predictions and PLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComputational and Text Analysis Methods
