Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Adam Karvonen, Samuel Marks

TL;DR
This paper demonstrates that internal bias mitigation through identifying and neutralizing sensitive attribute directions in LLMs significantly improves fairness in realistic hiring scenarios, outperforming simple prompt-based methods.
Contribution
The study introduces an internal bias mitigation technique that neutralizes demographic biases in LLMs by affine concept editing, ensuring robustness across diverse realistic contexts.
Findings
Biases up to 12\% in interview rates due to contextual factors
Biases favor Black over White and female over male candidates
Interventions reduce bias to below 2.5\% while maintaining performance
Abstract
Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people's careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10\%") induces…
Peer Reviews
Decision·Submitted to ICLR 2026
- The call for action to use more realistic and challenging evaluation for bias is very important. - The overall raising awareness about bias in LLMs, especially when employed for important decision such as hiring is also very important. - The authors examine a good set of models both commercial and open sourced. - The proposed intervention seemed effective on an individual demographic axes.
The paper mentions that biases favor Black over White and female over male in their setting — this is counter to many fairness concerns (which typically focus on disadvantage to historically marginalized groups). When such strong claims are presented the evaluation protocol needs to be extremely solid, clearly explained and results need to be thoroughly analyzed. While this should be true always, in this specific context, it is even more important. Unfortunately I find the paper lacking in all t
1. Open-source contribution: The authors release their codebase, data and method, which supports transparency and allows for reproducibility and future extensions. 2. Problem relevance: Bias in LLM-based hiring systems is an important and timely issue. The focus on robustness of debiasing under varying context complexity is conceptually interesting.
1. Overstatement of realism: The paper overclaims the “real-world” nature of its simulated hiring settings. Adding elements like company names, cultural descriptions, or hiring constraints (e.g., “hire the top 10% of candidates”) adds contextual richness, but it does not necessarily make the task realistic. The authors provide no evidence (e.g., comparisons to real hiring data or expert validation) to support that these additions meaningfully increase realism. 2. Prompt debiasing claims conflic
- Important real-world problem: The paper addresses a critical issue as LLMs are increasingly deployed in high-stakes hiring applications with direct impact on people's livelihoods. - Strong empirical findings: The demonstration that prompt-based mitigations become brittle under realistic conditions is well-documented across multiple models and scenarios. - Robust internal intervention: The proposed affine concept editing approach shows consistent effectiveness across different contexts. - Co
- Dataset quality issues: The authors acknowledge in Appendix E that 22% of resumes contained unintended demographic indicators, though they claim minimal impact on results. - Mechanistic clarity and design choices. Directions are estimated from synthetic data and applied at all layers/tokens. The paper doesn’t ablate which layers matter, how many directions per attribute are needed, or compare ACE against other linear methods (e.g., LEACE, NOP, DAS) in this setting. Gemma-3 sensitivity indicat
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNames, Identity, and Discrimination Research · Ethics and Social Impacts of AI · Authorship Attribution and Profiling
MethodsADaptive gradient method with the OPTimal convergence rate
