A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia

Lance Calvin Lim Gamboa; Mark Lee

arXiv:2410.15464·cs.CL·June 10, 2025

A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia

Lance Calvin Lim Gamboa, Mark Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new information-theoretic metric for attributing bias in language models at the token level, demonstrating its effectiveness on multilingual Southeast Asian models and revealing prevalent biases linked to specific topics.

Contribution

It proposes the bias attribution score, a novel metric for explainability of bias in language models, and applies it to uncover biases in Southeast Asian multilingual PLMs.

Findings

01

Southeast Asian PLMs exhibit sexist and homophobic biases.

02

Bias is strongly associated with words related to crime, relationships, and helping.

03

The new metric effectively identifies token-level bias contributions.

Abstract

Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and explainability. We propose a novel metric, the $bias attribution score$ , which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories, suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gamboalance/bias_attribution_scores
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling