Red-Teaming for Inducing Societal Bias in Large Language Models
Chu Fei Luo, Ahmad Ghawanmeh, Bharat Bhimshetty, Kashyap Murali, Murli Jadhav, Xiaodan Zhu, Faiza Khan Khattak

TL;DR
This paper introduces two bias-specific red-teaming methods, EBP and BiasKG, to evaluate and induce societal bias in large language models, revealing that safety measures often fail to prevent bias.
Contribution
The paper presents novel bias-focused red-teaming strategies, EBP and BiasKG, specifically designed to evaluate and induce social bias in LLMs, highlighting gaps in current safety guardrails.
Findings
Bias increases in all models tested, including those with safety guardrails.
Bias induction methods effectively reveal societal biases in LLMs.
Evaluation underscores the need for improved bias mitigation in AI safety measures.
Abstract
Ensuring the safe deployment of AI systems is critical in industry settings where biased outputs can lead to significant operational, reputational, and regulatory risks. Thorough evaluation before deployment is essential to prevent these hazards. Red-teaming addresses this need by employing adversarial attacks to develop guardrails that detect and reject biased or harmful queries, enabling models to be retrained or steered away from harmful outputs. However, most red-teaming efforts focus on harmful or unethical instructions rather than addressing social bias, leaving this critical area under-explored despite its significant real-world impact, especially in customer-facing systems. We propose two bias-specific red-teaming methods, Emotional Bias Probe (EBP) and BiasKG, to evaluate how standard safety measures for harmful content affect bias. For BiasKG, we refactor natural language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques
