Conformal Prediction for Long-Tailed Classification
Tiffany Ding, Jean-Baptiste Fermanian, Joseph Salmon

TL;DR
This paper introduces new conformal prediction methods tailored for long-tailed classification problems, balancing class-conditional coverage and set size, demonstrated on large-scale image datasets.
Contribution
It proposes a prevalence-adjusted softmax score and an interpolation procedure to improve conformal prediction in long-tailed settings.
Findings
Achieved better class-conditional coverage with smaller sets
Demonstrated effectiveness on large-scale datasets
Provided flexible trade-offs between coverage and set size
Abstract
Many real-world classification problems, such as plant identification, have extremely long-tailed class distributions. In order for prediction sets to be useful in such settings, they should (i) provide good class-conditional coverage, ensuring that rare classes are not systematically omitted from the prediction sets, and (ii) be a reasonable size, allowing users to easily verify candidate labels. Unfortunately, existing conformal prediction methods, when applied to the long-tailed setting, force practitioners to make a binary choice between small sets with poor class-conditional coverage or sets that have very good class-conditional coverage but are extremely large. We propose methods with marginal coverage guarantees that smoothly trade off set size and class-conditional coverage. First, we introduce a new conformal score function called prevalence-adjusted softmax that optimizes for…
Peer Reviews
Decision·ICLR 2026 Poster
1. Addresses the practically relevant problem of conformal prediction under long-tailed label distributions. 2. The proposed PAS/WPAS scores are simple and easy to implement. 3. The experiments offer preliminary evidence that the proposed method improves coverage fairness under long tailed setting.
1. Truncated dataset setup: The test datasets are balanced by truncating rare classes and retaining those with more than 100 samples per class. The choice of this threshold is not explained. Why 100? In more realistic scenarios where both calibration and test sets are long tailed, how would the proposed method perform? 2. Motivation: The new non-conformity score (PAS) is motivated by an oracle analysis showing that the optimal set depends on p(y|x)/p(y), but this only characterizes an ideal sol
Strength a) The main paper is well motivated, mostly clear and easy to follow. b) It tackles conformal prediction in the extreme long-tailed scenario, which is practically important. c) The class coverage vs prediction set size tradeoff as a problem formulation itself seems novel. The two proposed approaches also appear reasonably original. d) Empirical studies are convincing, and their human decision maker simulation experiment seems interesting to me.
Weaknesses a) I could understand the working of PAS/WPAS and INTERP-Q, but I couldn't clearly find the motivation for utilizing either/or both of them. The two methods seem to address different parts of the pipeline, but the paper does not clearly explain a practical guideline when a practitioner should pick PAS/WPAS, INTERP-Q, or use them together. b) The experimental details in the appendix mention utilizing a truncated version with n-core filtering with n = 101. I am curious: doesn't this c
Targeting macro coverage with a simple change to the score is a neat idea. It connects the oracle form of the optimal set for macro coverage to a practical score based on p hat of y given x divided by the estimated prevalence. The weighted version lets users push coverage toward special subsets like at risk species. The paper is easy to follow. The problem is well motivated with plant identification. The two approaches are separated and labeled. Table 1 is a good map of methods and guarantees.
PAS relies on p hat of y given x and an estimate of label prevalence. In real systems there is often label shift between train, calibration, and test. The paper does not test robustness under such shift, even though label shift directly changes the p of y term that PAS divides by. The 1 minus 2 alpha lower bound is likely conservative, as the authors note, but the paper does not quantify the realized marginal coverage gap across settings or give a simple correction to hit a target level. Most
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
