Learning from Red Teaming: Gender Bias Provocation and Mitigation in   Large Language Models

Hsuan Su; Cheng-Chu Cheng; Hua Farn; Shachi H Kumar; Saurav Sahay,; Shang-Tse Chen; Hung-yi Lee

arXiv:2310.11079·cs.CL·October 18, 2023·1 cites

Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay,, Shang-Tse Chen, Hung-yi Lee

PDF

Open Access

TL;DR

This paper introduces an automatic test case generation method to detect and mitigate gender bias in large language models, improving fairness without requiring model fine-tuning.

Contribution

It presents the first automated approach for generating bias test cases and demonstrates effective bias mitigation using in-context learning.

Findings

01

Generated test cases effectively identify gender biases in LLMs.

02

Using test cases as demonstrations reduces gender bias in model responses.

03

The approach improves fairness without fine-tuning the models.

Abstract

Recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (LLMs) such as ChatGPT and GPT-4. These LLM-based chatbots encode the potential biases while retaining disparities that can harm humans during interactions. The traditional biases investigation methods often rely on human-written test cases. However, these test cases are usually expensive and limited. In this work, we propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias. We apply our method to three well-known LLMs and find that the generated test cases effectively identify the presence of biases. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. The experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Hate Speech and Cyberbullying Detection

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Softmax · Residual Connection · Absolute Position Encodings · Layer Normalization · Adam · Byte Pair Encoding