Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word   Embeddings

Carolin M. Schuster; Maria-Alexandra Dinisor; Shashwat Ghatiwala and; Georg Groh

arXiv:2411.16527·cs.CL·January 14, 2025

Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Carolin M. Schuster, Maria-Alexandra Dinisor, Shashwat Ghatiwala and, Georg Groh

PDF

Open Access 1 Repo

TL;DR

This paper introduces a method for profiling biases in large language models by analyzing stereotype dimensions in contextual embeddings, providing intuitive visualizations to better understand and communicate model biases.

Contribution

It proposes bias profiles based on social psychology dictionaries to systematically analyze and visualize gender bias across multiple LLMs and contexts.

Findings

01

Gender bias varies across layers and contexts in LLMs.

02

Bias profiles effectively reveal and compare model stereotypes.

03

Profiles facilitate understanding and mitigation of biases.

Abstract

Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

carolinmschuster/profiling-bias-in-llms
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Hate Speech and Cyberbullying Detection