Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
Bhavik Chandna, Zubair Bashir, Procheta Sen

TL;DR
This paper uses mechanistic interpretability to analyze how biases are structurally embedded in large language models like GPT-2 and Llama2, revealing localized bias components and their impact on multiple NLP tasks.
Contribution
It introduces a systematic approach to identify and analyze bias-related components within LLMs, showing their localization, variability, and influence on other NLP tasks.
Findings
Bias components are highly localized within specific layers.
Removing bias components reduces biased outputs but also impacts other NLP tasks.
Bias components change across fine-tuning settings, indicating their dynamic nature.
Abstract
Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Attention Dropout · Discriminative Fine-Tuning · Byte Pair Encoding · Softmax · Linear Layer · Dropout
