Reducing Sentiment Bias in Language Models via Counterfactual Evaluation
Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl,, Jack Rae, Vishal Maini, Dani Yogatama, Pushmeet Kohli

TL;DR
This paper investigates sentiment bias in language models, quantifies it using fairness metrics, and proposes regularization techniques to reduce bias while maintaining text quality.
Contribution
It introduces a counterfactual evaluation framework for sentiment bias and proposes regularization methods to mitigate bias in large-scale language models.
Findings
Large models exhibit significant sentiment bias.
Regularization improves fairness metrics.
Bias reduction does not significantly affect perplexity or semantic similarity.
Abstract
Advances in language modeling architectures and the availability of large text corpora have driven progress in automatic text generation. While this results in models capable of generating coherent texts, it also prompts models to internalize social biases present in the training corpus. This paper aims to quantify and reduce a particular type of bias exhibited by language models: bias in the sentiment of generated text. Given a conditioning context (e.g., a writing prompt) and a language model, we analyze if (and how) the sentiment of the generated text is affected by changes in values of sensitive attributes (e.g., country names, occupations, genders) in the conditioning context using a form of counterfactual evaluation. We quantify sentiment bias by adopting individual and group fairness metrics from the fair machine learning literature, and demonstrate that large-scale models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Hate Speech and Cyberbullying Detection · Ethics and Social Impacts of AI
