Evaluating Black-Box Classifiers via Stable Adaptive Two-Sample Inference
Yuchen Chen, Jing Lei

TL;DR
This paper introduces a new statistical testing methodology for evaluating black-box multi-class classifiers by reducing the problem to two-sample distribution testing using auxiliary classifiers and conformal inference techniques.
Contribution
It proposes a novel approach combining ideas from fairness, Neyman-Pearson, and conformal p-values to rigorously test classifier performance without assumptions on training algorithms.
Findings
The method provides an asymptotically valid test under stability conditions.
It leverages a distinguisher trained via cross-validation to compare distributions.
The approach is applicable to black-box classifiers with minimal assumptions.
Abstract
We consider the problem of evaluating black-box multi-class classifiers. In the standard setup, we observe class labels generated according to the conditional distribution where denotes the features and maps from the feature space to the -dimensional simplex. A black-box classifier is an estimate for which we make no assumptions about the training algorithm. Given holdout data, our goal is to evaluate the performance of the classifier . Recent work suggests treating this as a goodness-of-fit problem by testing the hypothesis , where is some metric between two distributions, and . Combining ideas from algorithmic fairness, Neyman-Pearson lemma, and conformal p-values, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
