Certifying Counterfactual Bias in LLMs
Isha Chaudhary, Qian Hu, Manoj Kumar, Morteza Ziyadi, Rahul Gupta,, Gagandeep Singh

TL;DR
This paper introduces LLMCert-B, a novel framework that certifies large language models for counterfactual bias by providing high-confidence bounds on unbiased responses across diverse demographic prompts.
Contribution
It is the first framework to certify LLMs for counterfactual bias with guarantees over distributions of prompts, addressing scalability and reliability issues of prior bias evaluations.
Findings
Generated certificates for SOTA LLMs exposing vulnerabilities.
Applied certification to various prompt distributions including jailbreaks.
Demonstrated the framework's ability to identify biases in large models.
Abstract
Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate biases across LLM responses for different demographic groups (a.k.a. counterfactual bias), as they do not scale to large number of inputs and do not provide guarantees. Therefore, we propose the first framework, LLMCert-B that certifies LLMs for counterfactual bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of counterfactual prompts - prompts differing by demographic groups, sampled from a distribution. We illustrate counterfactual bias certification for distributions of counterfactual prompts created by applying prefixes sampled from prefix distributions, to a given set of prompts. We consider prefix distributions consisting random…
Peer Reviews
Decision·ICLR 2025 Poster
Strengths of this paper include: - Quantification of bias in LLMs is an important issue that is of interest to the community. - The strategy of encoding information about counterfactual prompts into a prompt distribution is interesting. - The experiments raise valuable insights about the sort of prompts for which current LLMs are likely to generate biased responses.
Weaknesses of this paper include: - From a statistical methodology standpoint, there is not much novelty since the method is a direct application of Clopper-Pearson intervals. - The writing in Section 3 could be improved in several ways. 1. Specifying what spaces the different quantities (e.g. $\mathcal{G}$, $\mathcal{A}$) live in, and what operations (e.g. string concatenation) can be applied to them. 2. Clarifying Definition 1, especially part (3) as it is not directly obvious how an unbia
- The paper is very well-written, organized, and clear - As I understood the paper, the approach is straightforward: for a certain set of prompts that may be susceptible to biased model responses, the authors ask, "can we provide confidence intervals bounding how often the model will produce those unwanted/biased responses by essentially generating lots of responses under pertubations/different conditions?" I wouldn’t be surprised if others called this simplicity out as a weakness — but in fact
- I felt the use of “certificates” was not well-motivated, and even after reading the paper, I’m not sure why one would prefer certificates over just “confidence intervals.” I am unfamiliar with the literature on certificates, and I would recommend the authors better motivate this beyond talking about how others have restricted certificates to local specifications (Lines 246-255). - The authors might suggest adding a section to connect with practitioners who want to deploy their tool, i.e. how
The authors propose a new approach to provide certifications of unbiasedness (with response to counterfactual prompts), which previously has not been studied. This is a well-motivated problem setting, especially in the case of black-box models.
The bias metric seems to match human judgments 76% of the time (as noted in the Appendix). This seems relatively low and brings into question the usability of this metric. I also believe that this information in the Appendix is quite important and should be included in the main text of the paper.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
MethodsSparse Evolutionary Training
