Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models
Gianmario Voria, Moses Openja, Foutse Khomh, Gemma Catolino, Fabio Palomba

TL;DR
This paper investigates how social stereotypes are encoded in transformer models, identifies biased neurons, and demonstrates that suppressing these neurons can reduce bias with minimal impact on model performance.
Contribution
It introduces a method to trace and suppress biased neurons in pre-trained transformers, advancing interpretability and fairness in AI models for software engineering.
Findings
Biased knowledge is localized in small neuron subsets.
Suppressing biased neurons reduces stereotypes significantly.
Minimal performance loss occurs after bias suppression.
Abstract
The advent of transformer-based language models has reshaped how AI systems process and generate text. In software engineering (SE), these models now support diverse activities, accelerating automation and decision-making. Yet, evidence shows that these models can reproduce or amplify social biases, raising fairness concerns. Recent work on neuron editing has shown that internal activations in pre-trained transformers can be traced and modified to alter model behavior. Building on the concept of knowledge neurons, neurons that encode factual information, we hypothesize the existence of biased neurons that capture stereotypical associations within pre-trained transformers. To test this hypothesis, we build a dataset of biased relations, i.e., triplets encoding stereotypes across nine bias types, and adapt neuron attribution strategies to trace and suppress biased neurons in BERT models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
