Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions
Jagdish Tripathy, Marcus Buckmann

TL;DR
This paper reveals that instruction-tuned language models can hide internal demographic biases that, when manipulated, significantly influence high-stakes decision outputs, highlighting the need for deeper internal bias analysis.
Contribution
It uncovers the causal potency and asymmetry of latent biases in language models, demonstrating how internal representations can be manipulated to alter decisions.
Findings
Models retain and amplify demographic representations internally.
Reinjecting latent bias at critical layers can reverse decisions.
Latent bias effects are asymmetric and vulnerable to adversarial prompts.
Abstract
Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-complete decision reversals. Critically, this latent bias is asymmetric - steering interventions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
