FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in   Large Language Models

Yanhong Bai; Jiabao Zhao; Jinxin Shi; Zhentao Xie; Xingjiao Wu; Liang; He

arXiv:2405.03098·cs.CL·May 7, 2024

FairMonitor: A Dual-framework for Detecting Stereotypes and Biases in Large Language Models

Yanhong Bai, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xingjiao Wu, Liang, He

PDF

Open Access

TL;DR

FairMonitor introduces a dual-framework combining static and dynamic detection methods to comprehensively identify explicit and implicit stereotypes and biases in Large Language Models across diverse scenarios.

Contribution

This work presents a novel dual-framework that effectively detects nuanced stereotypes and biases in LLMs using static tests and multi-agent dynamic scenarios.

Findings

01

Static tests evaluate explicit and implicit biases with over 10,000 questions.

02

Dynamic scenarios reveal subtle biases through multi-agent interactions.

03

Combined methods improve bias detection accuracy in LLMs.

Abstract

Detecting stereotypes and biases in Large Language Models (LLMs) is crucial for enhancing fairness and reducing adverse impacts on individuals or groups when these models are applied. Traditional methods, which rely on embedding spaces or are based on probability metrics, fall short in revealing the nuanced and implicit biases present in various contexts. To address this challenge, we propose the FairMonitor framework and adopt a static-dynamic detection method for a comprehensive evaluation of stereotypes and biases in LLMs. The static component consists of a direct inquiry test, an implicit association test, and an unknown situation test, including 10,262 open-ended questions with 9 sensitive factors and 26 educational scenarios. And it is effective for evaluating both explicit and implicit biases. Moreover, we utilize the multi-agent system to construst the dynamic scenarios for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection