A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia
Lance Calvin Lim Gamboa, Mark Lee

TL;DR
This paper introduces a new information-theoretic metric for attributing bias in language models at the token level, demonstrating its effectiveness on multilingual Southeast Asian models and revealing prevalent biases linked to specific topics.
Contribution
It proposes the bias attribution score, a novel metric for explainability of bias in language models, and applies it to uncover biases in Southeast Asian multilingual PLMs.
Findings
Southeast Asian PLMs exhibit sexist and homophobic biases.
Bias is strongly associated with words related to crime, relationships, and helping.
The new metric effectively identifies token-level bias contributions.
Abstract
Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and explainability. We propose a novel metric, the , which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories, suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
