Race, Ethnicity and Their Implication on Bias in Large Language Models
Shiyue Hu, Ruizhe Li, Yanjun Gao

TL;DR
This study investigates how large language models encode race and ethnicity internally, revealing diverse representations and the partial effectiveness of interventions in reducing bias, highlighting the need for systematic mitigation strategies.
Contribution
It provides a mechanistic analysis of demographic representation in LLMs using interpretability techniques, revealing internal encoding and intervention effects on bias.
Findings
Demographic info is distributed across internal units with variation.
Some units encode stereotypes from pretraining.
Interventions reduce bias but leave residual effects.
Abstract
Large language models (LLMs) increasingly operate in high-stakes settings including healthcare and medicine, where demographic attributes such as race and ethnicity may be explicitly stated or implicitly inferred from text. However, existing studies primarily document outcome-level disparities, offering limited insight into internal mechanisms underlying these effects. We present a mechanistic study of how race and ethnicity are represented and operationalized within LLMs. Using two publicly available datasets spanning toxicity-related generation and clinical narrative understanding tasks, we analyze three open-source models with a reproducible interpretability pipeline combining probing, neuron-level attribution, and targeted intervention. We find that demographic information is distributed across internal units with substantial cross-model variation. Although some units encode…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare
