Conformal Prediction for Long-Tailed Classification

Tiffany Ding; Jean-Baptiste Fermanian; Joseph Salmon

arXiv:2507.06867·stat.ML·March 2, 2026

Conformal Prediction for Long-Tailed Classification

Tiffany Ding, Jean-Baptiste Fermanian, Joseph Salmon

PDF

Open Access 3 Reviews

TL;DR

This paper introduces new conformal prediction methods tailored for long-tailed classification problems, balancing class-conditional coverage and set size, demonstrated on large-scale image datasets.

Contribution

It proposes a prevalence-adjusted softmax score and an interpolation procedure to improve conformal prediction in long-tailed settings.

Findings

01

Achieved better class-conditional coverage with smaller sets

02

Demonstrated effectiveness on large-scale datasets

03

Provided flexible trade-offs between coverage and set size

Abstract

Many real-world classification problems, such as plant identification, have extremely long-tailed class distributions. In order for prediction sets to be useful in such settings, they should (i) provide good class-conditional coverage, ensuring that rare classes are not systematically omitted from the prediction sets, and (ii) be a reasonable size, allowing users to easily verify candidate labels. Unfortunately, existing conformal prediction methods, when applied to the long-tailed setting, force practitioners to make a binary choice between small sets with poor class-conditional coverage or sets that have very good class-conditional coverage but are extremely large. We propose methods with marginal coverage guarantees that smoothly trade off set size and class-conditional coverage. First, we introduce a new conformal score function called prevalence-adjusted softmax that optimizes for…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Addresses the practically relevant problem of conformal prediction under long-tailed label distributions. 2. The proposed PAS/WPAS scores are simple and easy to implement. 3. The experiments offer preliminary evidence that the proposed method improves coverage fairness under long tailed setting.

Weaknesses

1. Truncated dataset setup: The test datasets are balanced by truncating rare classes and retaining those with more than 100 samples per class. The choice of this threshold is not explained. Why 100? In more realistic scenarios where both calibration and test sets are long tailed, how would the proposed method perform? 2. Motivation: The new non-conformity score (PAS) is motivated by an oracle analysis showing that the optimal set depends on p(y|x)/p(y), but this only characterizes an ideal sol

Reviewer 02Rating 6Confidence 3

Strengths

Strength a) The main paper is well motivated, mostly clear and easy to follow. b) It tackles conformal prediction in the extreme long-tailed scenario, which is practically important. c) The class coverage vs prediction set size tradeoff as a problem formulation itself seems novel. The two proposed approaches also appear reasonably original. d) Empirical studies are convincing, and their human decision maker simulation experiment seems interesting to me.

Weaknesses

Weaknesses a) I could understand the working of PAS/WPAS and INTERP-Q, but I couldn't clearly find the motivation for utilizing either/or both of them. The two methods seem to address different parts of the pipeline, but the paper does not clearly explain a practical guideline when a practitioner should pick PAS/WPAS, INTERP-Q, or use them together. b) The experimental details in the appendix mention utilizing a truncated version with n-core filtering with n = 101. I am curious: doesn't this c

Reviewer 03Rating 6Confidence 4

Strengths

Targeting macro coverage with a simple change to the score is a neat idea. It connects the oracle form of the optimal set for macro coverage to a practical score based on p hat of y given x divided by the estimated prevalence. The weighted version lets users push coverage toward special subsets like at risk species. The paper is easy to follow. The problem is well motivated with plant identification. The two approaches are separated and labeled. Table 1 is a good map of methods and guarantees.

Weaknesses

PAS relies on p hat of y given x and an estimate of label prevalence. In real systems there is often label shift between train, calibration, and test. The paper does not test robustness under such shift, even though label shift directly changes the p of y term that PAS divides by. The 1 minus 2 alpha lower bound is likely conservative, as the authors note, but the paper does not quantify the realized marginal coverage gap across settings or give a simple correction to hit a target level. Most

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications