Conformal Prediction Sets Can Cause Disparate Impact
Jesse C. Cresswell, Bhargava Kumar, Yi Sui, Mouloud Belbahri

TL;DR
This paper reveals that conformal prediction sets, while useful for uncertainty quantification, can unintentionally cause fairness issues, and proposes a new approach to mitigate disparate impact by equalizing set sizes.
Contribution
The paper demonstrates that equalized coverage can increase disparate impact and introduces set size equalization as a more effective fairness criterion.
Findings
Providing prediction sets can lead to disparate impact.
Equalized coverage may increase disparities.
Equalizing set sizes reduces disparate impact.
Abstract
Conformal prediction is a statistically rigorous method for quantifying uncertainty in models by having them output sets of predictions, with larger sets indicating more uncertainty. However, prediction sets are not inherently actionable; many applications require a single output to act on, not several. To overcome this limitation, prediction sets can be provided to a human who then makes an informed decision. In any such system it is crucial to ensure the fairness of outcomes across protected groups, and researchers have proposed that Equalized Coverage be used as the standard for fairness. By conducting experiments with human participants, we demonstrate that providing prediction sets can lead to disparate impact in decisions. Disquietingly, we find that providing sets that satisfy Equalized Coverage actually increases disparate impact compared to marginal coverage. Instead of…
Peer Reviews
Decision·ICLR 2025 Spotlight
The authors rightly identify "equal coverage" as a limited notion of fairness for prediction sets, in the sense that the end-goal of providing prediction sets is ultimately to improve a downstream task -- i.e., equalized coverage in itself ought to be desirable only to the extent that it improves the utility of predictions in general. Evaluation via human study is also an important perspective; I see this work as playing a similar role as the work that conducted human evaluations of explainabili
* The main weakness of this work to me is that its hypotheses are a priori unsurprising. If some subgroup A requires larger set sizes for the same coverage level, it is tautologically true that there is higher uncertainty for subgroup A and that any class in the prediction set is less likely to be the ground-truth label. Even if a person relied 100% on the prediction set and simply picked uniformly at random among the predictions, one would expect performance to be worse on groups with bigger se
* The background on conformal prediction provided by the authors is well written, as are the justifications for the models picked at inference time. * The authors pre-registered the hypothesis they tested * I think the findings are generally novel and interesting. As figure 5 indicates, the fact that utilizing Mondrian CP can reduce accuracy for on certain groups has important implications for fairness practices when prediction sets are required. * Generally, this feels like a paper that could s
* The authors primary focus is to study disparate impact as measured by improvement *gain* from a given CP treatment (as compared to a control). This is fine, but I would argue that if a CP treatment improves accuracy on each sub-group (even if at a different amount) then this isn't necessarily a bad thing, for example the minimax fairness would be improved. In this sense, I actually found figure 5 the most interesting observation, as it indicates how Mondrian CP can actually leave the worst-cas
As mentioned, I find that such connection between metrics on the set and end-behavior are needed. Paper is quite easy and claims are well supported by evidence.
Abstract could be a bit clearer, in particular for people less familiar with the subject.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
MethodsSparse Evolutionary Training
