Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection
Yachao Zhao, Bo Wang, Yan Wang, Dongming Zhao, Ruifang He, Yuexian Hou

TL;DR
This paper introduces a social psychology-inspired framework to systematically compare explicit and implicit biases in large language models, revealing significant differences and underlying factors affecting bias manifestation.
Contribution
It presents a novel self-reflection-based evaluation method for implicit bias and uncovers the contrasting behaviors of explicit and implicit biases in LLMs.
Findings
Implicit bias is stronger and more persistent than explicit bias.
Explicit bias decreases with larger training data and models, while implicit bias increases.
Alignment techniques reduce explicit bias but have limited impact on implicit bias.
Abstract
Large Language Models (LLMs) have been shown to exhibit various biases and stereotypes in their generated content. While extensive research has investigated biases in LLMs, prior work has predominantly focused on explicit bias, with minimal attention to implicit bias and the relation between these two forms of bias. This paper presents a systematic framework grounded in social psychology theories to investigate and compare explicit and implicit biases in LLMs. We propose a novel self-reflection-based evaluation framework that operates in two phases: first measuring implicit bias through simulated psychological assessment methods, then evaluating explicit bias by prompting LLMs to analyze their own generated content. Through extensive experiments on advanced LLMs across multiple social dimensions, we demonstrate that LLMs exhibit a substantial inconsistency between explicit and implicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Hate Speech and Cyberbullying Detection
