Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
Hiba Ahsan, Byron C. Wallace

TL;DR
This paper investigates how Sparse Autoencoders can identify and influence racial biases in healthcare-related large language models, revealing potential for bias detection but limited effectiveness in mitigation for complex tasks.
Contribution
The study demonstrates that SAEs can uncover racial associations in LLMs and steer outputs, but their utility in bias mitigation is limited in realistic clinical scenarios.
Findings
SAEs identify race-related latent features in models.
Steering models via SAEs can influence racial bias in outputs.
Bias mitigation via SAE steering is less effective in complex tasks.
Abstract
LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts. We first identify SAE latents in Gemma-2 models which appear to correlate with Black individuals. We find that this latent activates on reasonable input sequences (e.g., "African American") but also problematic words like "incarceration". We then show that we can use this latent to steer models to generate outputs about Black patients, and further that this can induce problematic…
Peer Reviews
Decision·ICLR 2026 Poster
I really like the use of race-correlated latents to investigate bias and also appreciate that the authors show that CoT is not as useful in this regard. I also appreciate the steering experiments, which shows some notion of causality here.
-One major limitation is using just two models from the same family. In addition, I am curious why they used the Gemma family instead of the medgemma family, which specifically was trained for medical tasks. -It would be nice to show some examples of the CoT which failed to catch the bias in the appendix -While there is an ethics section, emphasize how latent‐steering tools could be misused (e.g., malicious “race injection” in model inputs) and how to guard against that. -Some of the effect siz
I really like the use of race-correlated latents to investigate bias and also appreciate that the authors show that CoT is not as useful in this regard. I also appreciate the steering experiments, which shows some notion of causality here.
-One major limitation is using just two models from the same family. In addition, I am curious why they used the Gemma family instead of the medgemma family, which specifically was trained for medical tasks. -It would be nice to show some examples of the CoT which failed to catch the bias in the appendix -While there is an ethics section, emphasize how latent‐steering tools could be misused (e.g., malicious “race injection” in model inputs) and how to guard against that. -Some of the effect siz
1. The authors tackle the problem of bias in clinical LLMs through a mechanistic interpretability perspective, which is an important real-world problem. 2. The authors conduct a fairly thorough set of tests to probe the "Black latent" that they discover.
1. The paper is essentially a case study on one specific type of bias (stereotypes against Black patients) in two specific open-source LLMs. It is unclear whether these findings would translate to biases against other demographic groups, or whether there would be a corresponding latent for all such biases. Further, it is unclear for what categories of demographics and clinical concepts the latents are disentangled. 2. I'm not convinced by the authors' argument in 4.2.1 that all of the biases pr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling
