Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
Hiba Ahsan, Arnab Sen Sharma, Silvio Amir, David Bau, Byron C. Wallace

TL;DR
This paper uses mechanistic interpretability to identify and manipulate sociodemographic biases in healthcare-related LLMs, revealing localized gender encoding and more distributed race representations, with implications for clinical fairness.
Contribution
It introduces the first application of mechanistic interpretability to healthcare LLMs, uncovering how sociodemographic information is encoded and can be manipulated at the layer and neuron level.
Findings
Gender information is highly localized in MLP layers.
Interventions can alter clinical vignette generation and predictions.
Race information is more distributed but still manipulable.
Abstract
We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHealthcare Policy and Management
MethodsADaptive gradient method with the OPTimal convergence rate
