Can We Locate and Prevent Stereotypes in LLMs?

Alex D'Souza

arXiv:2604.19764·cs.CL·April 23, 2026

Can We Locate and Prevent Stereotypes in LLMs?

Alex D'Souza

PDF

TL;DR

This paper investigates the internal neural mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype-related activations, aiming to understand and mitigate harmful biases in large language models.

Contribution

It introduces methods to identify specific neurons and attention heads responsible for stereotypes, providing initial insights for bias mitigation in LLMs.

Findings

01

Identified contrastive neurons encoding stereotypes.

02

Detected attention heads contributing to biased outputs.

03

Mapped 'bias fingerprints' within neural networks.

Abstract

Stereotypes in large language models (LLMs) can perpetuate harmful societal biases. Despite the widespread use of models, little is known about where these biases reside in the neural network. This study investigates the internal mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype related activations. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads that contribute heavily to biased outputs. Our experiments aim to map these "bias fingerprints" and provide initial insights for mitigating stereotypes.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.