Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Adam Karvonen; Samuel Marks

arXiv:2506.10922·cs.LG·June 13, 2025

Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Adam Karvonen, Samuel Marks

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper demonstrates that internal bias mitigation through identifying and neutralizing sensitive attribute directions in LLMs significantly improves fairness in realistic hiring scenarios, outperforming simple prompt-based methods.

Contribution

The study introduces an internal bias mitigation technique that neutralizes demographic biases in LLMs by affine concept editing, ensuring robustness across diverse realistic contexts.

Findings

01

Biases up to 12\% in interview rates due to contextual factors

02

Biases favor Black over White and female over male candidates

03

Interventions reduce bias to below 2.5\% while maintaining performance

Abstract

Large language models (LLMs) are increasingly deployed in high-stakes hiring applications, making decisions that directly impact people's careers and livelihoods. While prior studies suggest simple anti-bias prompts can eliminate demographic biases in controlled evaluations, we find these mitigations fail when realistic contextual details are introduced. We address these failures through internal bias mitigation: by identifying and neutralizing sensitive attribute directions within model activations, we achieve robust bias reduction across all tested scenarios. Across leading commercial (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source models (Gemma-2 27B, Gemma-3, Mistral-24B), we find that adding realistic context such as company names, culture descriptions from public careers pages, and selective hiring constraints (e.g.,``only accept candidates in the top 10\%") induces…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The call for action to use more realistic and challenging evaluation for bias is very important. - The overall raising awareness about bias in LLMs, especially when employed for important decision such as hiring is also very important. - The authors examine a good set of models both commercial and open sourced. - The proposed intervention seemed effective on an individual demographic axes.

Weaknesses

The paper mentions that biases favor Black over White and female over male in their setting — this is counter to many fairness concerns (which typically focus on disadvantage to historically marginalized groups). When such strong claims are presented the evaluation protocol needs to be extremely solid, clearly explained and results need to be thoroughly analyzed. While this should be true always, in this specific context, it is even more important. Unfortunately I find the paper lacking in all t

Reviewer 02Rating 2Confidence 4

Strengths

1. Open-source contribution: The authors release their codebase, data and method, which supports transparency and allows for reproducibility and future extensions. 2. Problem relevance: Bias in LLM-based hiring systems is an important and timely issue. The focus on robustness of debiasing under varying context complexity is conceptually interesting.

Weaknesses

1. Overstatement of realism: The paper overclaims the “real-world” nature of its simulated hiring settings. Adding elements like company names, cultural descriptions, or hiring constraints (e.g., “hire the top 10% of candidates”) adds contextual richness, but it does not necessarily make the task realistic. The authors provide no evidence (e.g., comparisons to real hiring data or expert validation) to support that these additions meaningfully increase realism. 2. Prompt debiasing claims conflic

Reviewer 03Rating 6Confidence 3

Strengths

- Important real-world problem: The paper addresses a critical issue as LLMs are increasingly deployed in high-stakes hiring applications with direct impact on people's livelihoods. - Strong empirical findings: The demonstration that prompt-based mitigations become brittle under realistic conditions is well-documented across multiple models and scenarios. - Robust internal intervention: The proposed affine concept editing approach shows consistent effectiveness across different contexts. - Co

Weaknesses

- Dataset quality issues: The authors acknowledge in Appendix E that 22% of resumes contained unintended demographic indicators, though they claim minimal impact on results. - Mechanistic clarity and design choices. Directions are estimated from synthetic data and applied at all layers/tokens. The paper doesn’t ablate which layers matter, how many directions per attribute are needed, or compare ACE against other linear methods (e.g., LEACE, NOP, DAS) in this setting. Gemma-3 sensitivity indicat

Code & Models

Repositories

adamkarvonen/llm_bias
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNames, Identity, and Discrimination Research · Ethics and Social Impacts of AI · Authorship Attribution and Profiling

MethodsADaptive gradient method with the OPTimal convergence rate