Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models
Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay,, Shang-Tse Chen, Hung-yi Lee

TL;DR
This paper introduces an automatic test case generation method to detect and mitigate gender bias in large language models, improving fairness without requiring model fine-tuning.
Contribution
It presents the first automated approach for generating bias test cases and demonstrates effective bias mitigation using in-context learning.
Findings
Generated test cases effectively identify gender biases in LLMs.
Using test cases as demonstrations reduces gender bias in model responses.
The approach improves fairness without fine-tuning the models.
Abstract
Recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (LLMs) such as ChatGPT and GPT-4. These LLM-based chatbots encode the potential biases while retaining disparities that can harm humans during interactions. The traditional biases investigation methods often rely on human-written test cases. However, these test cases are usually expensive and limited. In this work, we propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias. We apply our method to three well-known LLMs and find that the generated test cases effectively identify the presence of biases. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. The experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Hate Speech and Cyberbullying Detection
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Softmax · Residual Connection · Absolute Position Encodings · Layer Normalization · Adam · Byte Pair Encoding
