Evaluating Black-Box Classifiers via Stable Adaptive Two-Sample Inference

Yuchen Chen; Jing Lei

arXiv:2604.05470·stat.ME·April 8, 2026

Evaluating Black-Box Classifiers via Stable Adaptive Two-Sample Inference

Yuchen Chen, Jing Lei

PDF

TL;DR

This paper introduces a new statistical testing methodology for evaluating black-box multi-class classifiers by reducing the problem to two-sample distribution testing using auxiliary classifiers and conformal inference techniques.

Contribution

It proposes a novel approach combining ideas from fairness, Neyman-Pearson, and conformal p-values to rigorously test classifier performance without assumptions on training algorithms.

Findings

01

The method provides an asymptotically valid test under stability conditions.

02

It leverages a distinguisher trained via cross-validation to compare distributions.

03

The approach is applicable to black-box classifiers with minimal assumptions.

Abstract

We consider the problem of evaluating black-box multi-class classifiers. In the standard setup, we observe class labels $Y \in {0, 1, \dots, M - 1}$ generated according to the conditional distribution $Y ∣ X \sim Multinom (η (X)),$ where $X$ denotes the features and $η$ maps from the feature space to the $(M - 1)$ -dimensional simplex. A black-box classifier is an estimate $\overset{η}{^}$ for which we make no assumptions about the training algorithm. Given holdout data, our goal is to evaluate the performance of the classifier $\overset{η}{^}$ . Recent work suggests treating this as a goodness-of-fit problem by testing the hypothesis $H_{0} : ρ ((X, Y), (X^{'}, Y^{'})) \leq δ$ , where $ρ$ is some metric between two distributions, and $(X^{'}, Y^{'}) \sim P_{X} \times Multinom (\overset{η}{^} (X))$ . Combining ideas from algorithmic fairness, Neyman-Pearson lemma, and conformal p-values, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.