Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing

Ruicheng Xian; Yuxuan Wan; Han Zhao

arXiv:2508.11258·cs.LG·August 18, 2025

Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing

Ruicheng Xian, Yuxuan Wan, Han Zhao

PDF

3 Reviews

TL;DR

This paper introduces a post-processing framework that enables group fairness in predictions from closed large language models by extracting features via prompting and applying fair algorithms, addressing fairness in high-stakes applications.

Contribution

It proposes a novel prompting-based approach to derive fair classifiers from closed LLMs, bypassing the need for fine-tuning or head-tuning.

Findings

01

Strong accuracy-fairness tradeoffs achieved on multiple datasets

02

Framework outperforms fine-tuning and scratch training methods

03

Data-efficient approach suitable for commercial closed LLMs

Abstract

Instruction fine-tuned large language models (LLMs) enable a simple zero-shot or few-shot prompting paradigm, also known as in-context learning, for building prediction models. This convenience, combined with continued advances in LLM capability, has the potential to drive their adoption across a broad range of domains, including high-stakes applications where group fairness -- preventing disparate impacts across demographic groups -- is essential. The majority of existing approaches to enforcing group fairness on LLM-based classifiers rely on traditional fair algorithms applied via model fine-tuning or head-tuning on final-layer embeddings, but they are no longer applicable to closed-weight LLMs under the in-context learning setting, which include some of the most capable commercial models today, such as GPT-4, Gemini, and Claude. In this paper, we propose a framework for deriving fair…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper is clearly written, accessible, and easy to follow. 2. It addresses the important and timely issue of achieving group fairness for closed-source LLMs. 3. The experimental evaluation is comprehensive and convincing, spanning five diverse datasets (including multi-class and overlapping-group cases), four LLMs (open and closed), and three fairness algorithms.

Weaknesses

1. This framework extracts P(A | Y, X) from the LLM, the extracted features inherit the model's own biases towards sensitive attributes. This means that the "sufficient statistics" are not objective measurements, but rather potentially biased surrogate indicators. Therefore, downstream calibration/post-processing may only mitigate measurement discrepancies based on these surrogate indicators without truly correcting for bias. 2. The probability extraction schemes typically require K+1 API calls

Reviewer 02Rating 4Confidence 4

Strengths

1. The problem background holds significant practical value: The study of fairness for closed-source LLMs is a current issue of considerable practical significance, especially under the realistic constraints where model weights are inaccessible. 2. The framework has universality: The method can be compatible with various fairness algorithms and different fairness definitions, and can be applied to various types of tasks (text and tabular data). 3. The framework has strong universality: The metho

Weaknesses

1. The innovation is limited. The method essentially combines the existing post-processing fairness algorithm with the LLM output. The innovation is limited, especially considering the existing work [1, 2]. 2. Insufficient analysis depth. The paper mainly focuses on verifying the "feasibility of the post-processing framework", but fails to conduct in-depth theoretical analysis on the mechanism for improving fairness. Besides, there is a lack of statistical significance testing for the improveme

Reviewer 03Rating 6Confidence 3

Strengths

The algorithm itself is clearly presented, particularly in figure 1 which does an excellent job of outlining the technique. For the low data regime, this work presents an interesting solution to mitigating bias in LLM-based classifiers where you don’t have access to the weights or embeddings of the model. This is a valuable contribution to the community. The technique the authors use to extract the prior and conditional likelihoods does a good job of eliciting the inherent bias in the model

Weaknesses

The abstract and introduction aren’t as clear as they could be as to what the algorithm is doing. In lines 024-025 where you mention probabilistic predictions, you could mention that you using prompting to illicit the prior distribution over classes and conditional likelihoods to model the inherent bias in the model, which then form the features for the lightweight fair classifier. Also, in the abstract you claim that your algorithm outperforms training from scratch on raw tabular features, bu

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.