Bias Similarity Measurement: A Black-Box Audit of Fairness Across LLMs

Hyejun Jeong; Shiqing Ma; Amir Houmansadr

arXiv:2410.12010·cs.LG·September 26, 2025

Bias Similarity Measurement: A Black-Box Audit of Fairness Across LLMs

Hyejun Jeong, Shiqing Ma, Amir Houmansadr

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Bias Similarity Measurement (BSM), a new method for comparing fairness across large language models by analyzing their biases as relational properties, enabling systematic auditing of their social bias behaviors.

Contribution

The paper presents BSM, a unified framework that measures bias similarity across models, revealing insights into how different models and tuning methods influence fairness and behavior.

Findings

01

Instruction tuning mainly enforces abstention rather than changing internal biases.

02

Small models gain little accuracy and may become less fair with forced choices.

03

Open-weight models can match or surpass proprietary systems in fairness.

Abstract

Large Language Models (LLMs) reproduce social biases, yet prevailing evaluations score models in isolation, obscuring how biases persist across families and releases. We introduce Bias Similarity Measurement (BSM), which treats fairness as a relational property between models, unifying scalar, distributional, behavioral, and representational signals into a single similarity space. Evaluating 30 LLMs on 1M+ prompts, we find that instruction tuning primarily enforces abstention rather than altering internal representations; small models gain little accuracy and can become less fair under forced choice; and open-weight models can match or exceed proprietary systems. Family signatures diverge: Gemma favors refusal, LLaMA 3.1 approaches neutrality with fewer refusals, and converges toward abstention-heavy behavior overall. Counterintuitively, Gemma 3 Instruct matches GPT-4-level fairness at…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- Novel Integration: First systematic fairness-specific relational framework combining behavioral, distributional, and representational signals with practical auditing workflows. - Abstention Analysis: Important distinction between fairness-through-caution and fairness-through-representation, demonstrating that high abstention conceals directional bias. - Counterintuitive Findings: Discovery that small models (LLaMA 3.2 3B, Gemma 3 4B) worsen with tuning; family-specific strategies (Gemma's refu

Weaknesses

- Limited Component Novelty: Applies established techniques [1,4]; large-scale comparative studies exist [2]; confirms known representation preservation [3] rather than discovering new phenomena. - No Solutions: Purely diagnostic framework; provides no debiasing methods unlike comparable work [8,9]. - Dataset and Coverage Limitations: English-only; limited to 4 dimensions; high failure rates (up to 85%) in open-ended generation; missing newer multi-modal [5] and multi-turn [6] benchmarks. - Inco

Reviewer 02Rating 6Confidence 3

Strengths

- A good number of models are tested to give a sense of bias similarity as a function of model family, model size, open models vs. closed models , and so on. - The components of the similarity score cover the range of dimensions along which a model might be biased (e.g., accuracy for functionality, cosine similarity for relative preferences, CKA for structural similarities in the representations, and so on). This helps to give an indication of why biased behaviors in models might be similar. - T

Weaknesses

- The main weakness is that bias is considered only with respect to individual traits. Intersectional bias is especially problematic. - The paper sometimes provides an observation, e.g., a generational bias trend where older models retain stereotypes whereas new models do not. There is no insight into learning why this is the case though. Sometimes this may not be possible, e.g. models might be closed, but in other instances it may be possible. How do we learn to prevent these observations. - In

Reviewer 03Rating 2Confidence 4

Strengths

* *Comprehensive empirical scope*: The authors conduct meticulous and large-scale evaluations, including a wide range of model families, fine-tuning settings, and bias datasets. This level of breadth is impressive and rare in fairness auditing work. * *Conceptual clarity*: The decomposition of bias analysis into categorical, distributional, behavioral, and representational dimensions provides a useful organizing lens. This taxonomy helps structure a complex and often fragmented area of research

Weaknesses

* Limited novelty: Beyond the proposed four-part taxonomy, the framework largely repurposes existing metrics (e.g., BBQ bias scores, cosine distance, abstention rates, CKA). While the structure is neat, the underlying analyses could arguably be achieved with standard bias-evaluation tools. The contribution may therefore be more organizational than methodological. * Utility evaluation is limited: The paper reports accuracy only on the disambiguated BBQ benchmark. Without comparison to broader be

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsALIGN