Certifying Counterfactual Bias in LLMs

Isha Chaudhary; Qian Hu; Manoj Kumar; Morteza Ziyadi; Rahul Gupta,; Gagandeep Singh

arXiv:2405.18780·cs.AI·April 23, 2025·3 cites

Certifying Counterfactual Bias in LLMs

Isha Chaudhary, Qian Hu, Manoj Kumar, Morteza Ziyadi, Rahul Gupta,, Gagandeep Singh

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces LLMCert-B, a novel framework that certifies large language models for counterfactual bias by providing high-confidence bounds on unbiased responses across diverse demographic prompts.

Contribution

It is the first framework to certify LLMs for counterfactual bias with guarantees over distributions of prompts, addressing scalability and reliability issues of prior bias evaluations.

Findings

01

Generated certificates for SOTA LLMs exposing vulnerabilities.

02

Applied certification to various prompt distributions including jailbreaks.

03

Demonstrated the framework's ability to identify biases in large models.

Abstract

Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certifies LLMs for counterfactual bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts - prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certification for distributions of counterfactual prompts created by applying prefixes sampled from prefix distributions, to a given set of prompts. We consider prefix distributions consisting random…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

Strengths of this paper include: - Quantification of bias in LLMs is an important issue that is of interest to the community. - The strategy of encoding information about counterfactual prompts into a prompt distribution is interesting. - The experiments raise valuable insights about the sort of prompts for which current LLMs are likely to generate biased responses.

Weaknesses

Weaknesses of this paper include: - From a statistical methodology standpoint, there is not much novelty since the method is a direct application of Clopper-Pearson intervals. - The writing in Section 3 could be improved in several ways. 1. Specifying what spaces the different quantities (e.g. $\mathcal{G}$, $\mathcal{A}$) live in, and what operations (e.g. string concatenation) can be applied to them. 2. Clarifying Definition 1, especially part (3) as it is not directly obvious how an unbia

Reviewer 02Rating 8Confidence 4

Strengths

- The paper is very well-written, organized, and clear - As I understood the paper, the approach is straightforward: for a certain set of prompts that may be susceptible to biased model responses, the authors ask, "can we provide confidence intervals bounding how often the model will produce those unwanted/biased responses by essentially generating lots of responses under pertubations/different conditions?" I wouldn’t be surprised if others called this simplicity out as a weakness — but in fact

Weaknesses

- I felt the use of “certificates” was not well-motivated, and even after reading the paper, I’m not sure why one would prefer certificates over just “confidence intervals.” I am unfamiliar with the literature on certificates, and I would recommend the authors better motivate this beyond talking about how others have restricted certificates to local specifications (Lines 246-255). - The authors might suggest adding a section to connect with practitioners who want to deploy their tool, i.e. how

Reviewer 03Rating 6Confidence 3

Strengths

The authors propose a new approach to provide certifications of unbiasedness (with response to counterfactual prompts), which previously has not been studied. This is a well-motivated problem setting, especially in the case of black-box models.

Weaknesses

The bias metric seems to match human judgments 76% of the time (as noted in the Appendix). This seems relatively low and brings into question the usability of this metric. I also believe that this information in the Appendix is quite important and should be included in the main text of the paper.

Code & Models

Repositories

uiuc-focal-lab/quacer-b
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods

MethodsSparse Evolutionary Training