
TL;DR
This paper investigates the internal neural mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype-related activations, aiming to understand and mitigate harmful biases in large language models.
Contribution
It introduces methods to identify specific neurons and attention heads responsible for stereotypes, providing initial insights for bias mitigation in LLMs.
Findings
Identified contrastive neurons encoding stereotypes.
Detected attention heads contributing to biased outputs.
Mapped 'bias fingerprints' within neural networks.
Abstract
Stereotypes in large language models (LLMs) can perpetuate harmful societal biases. Despite the widespread use of models, little is known about where these biases reside in the neural network. This study investigates the internal mechanisms of GPT 2 Small and Llama 3.2 to locate stereotype related activations. We explore two approaches: identifying individual contrastive neuron activations that encode stereotypes, and detecting attention heads that contribute heavily to biased outputs. Our experiments aim to map these "bias fingerprints" and provide initial insights for mitigating stereotypes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
