Identifying and Mitigating Social Bias Knowledge in Language Models
Ruizhe Chen, Yichen Li, Jianfei Yang, Joey Tianyi Zhou, Jian Wu,, Zuozhu Liu

TL;DR
This paper introduces BiaScope, a new benchmark for assessing social bias in language models, and proposes FAST, a fine-grained debiasing method that effectively reduces bias without sacrificing knowledge or accuracy.
Contribution
The paper presents a novel bias mitigation benchmark and a fine-grained debiasing approach that improves fairness while preserving knowledge in language models.
Findings
FAST outperforms existing debiasing methods in bias reduction.
FAST maintains knowledge retention and downstream task performance.
BiaScope provides a comprehensive evaluation of social bias mitigation.
Abstract
Generating fair and accurate predictions plays a pivotal role in deploying large language models (LLMs) in the real world. However, existing debiasing methods inevitably generate unfair or incorrect predictions as they are designed and evaluated to achieve parity across different social groups but leave aside individual commonsense facts, resulting in modified knowledge that elicits unreasonable or undesired predictions. In this paper, we first establish a new bias mitigation benchmark, BiaScope, which systematically assesses performance by leveraging newly constructed datasets and metrics on knowledge retention and generalization. Then, we propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases. FAST identifies the decisive layer responsible for storing social biases and then calibrates its outputs by integrating a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning
