Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study
Nour Bouchouchi, Thibault Laugel, Xavier Renard, Christophe Marsala, Marie-Jeanne Lesot, Marcin Detyniecki

TL;DR
This study introduces a unified framework to analyze gender bias in LLMs, revealing that alignment reduces expressed bias but not the underlying encoded bias, which can be reactivated under certain conditions.
Contribution
The paper presents a novel unified protocol for comparing internal representations and output bias in LLMs, and evaluates the effects of alignment on both.
Findings
Alignment reduces expressed gender bias in generated outputs.
Internal gender-related associations persist despite debiasing efforts.
Debiasing effects on benchmarks may not generalize to real-world scenarios.
Abstract
During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
