Promoting Equality in Large Language Models: Identifying and Mitigating   the Implicit Bias based on Bayesian Theory

Yongxin Deng (1); Xihe Qiu (1); Xiaoyu Tan (2); Jing Pan (3); Chen Jue; (1); Zhijun Fang (4); Yinghui Xu (5); Wei Chu (2); Yuan Qi (5) ((1) Shanghai; University of Engineering Science; (2) INF Technology (Shanghai) Co.; Ltd.,; (3) Monash University; (4) Donghua University; (5) Fudan University)

arXiv:2408.10608·cs.CL·August 21, 2024·2 cites

Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory

Yongxin Deng (1), Xihe Qiu (1), Xiaoyu Tan (2), Jing Pan (3), Chen Jue, (1), Zhijun Fang (4), Yinghui Xu (5), Wei Chu (2), Yuan Qi (5) ((1) Shanghai, University of Engineering Science, (2) INF Technology (Shanghai) Co., Ltd.,, (3) Monash University, (4) Donghua University

PDF

Open Access

TL;DR

This paper introduces a Bayesian theory-based framework called BTBR to identify and remove implicit biases in large language models, addressing subtle biases that evade existing mitigation techniques.

Contribution

The paper presents a novel bias removal method using Bayesian likelihood ratio screening and model editing, specifically targeting implicit biases in LLMs.

Findings

01

Confirmed the presence of implicit biases in LLMs

02

Demonstrated BTBR's effectiveness in bias removal

03

Improved fairness in LLM outputs

Abstract

Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model's weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical tasks across different demographic groups, thereby camouflaging their presence. To address this issue, we have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory, Bayesian-Theory based Bias Removal (BTBR). BTBR employs likelihood ratio screening to pinpoint data entries within publicly accessible biased datasets that represent biases inadvertently incorporated during the LLM training phase. It then automatically constructs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Employee Welfare and Language Studies · Natural Language Processing Techniques