Evaluating Model Performance Under Worst-case Subpopulations

Mike Li; Daksh Mittal; Hongseok Namkoong; Shangzhou Xia

arXiv:2407.01316·cs.LG·December 9, 2025·6 cites

Evaluating Model Performance Under Worst-case Subpopulations

Mike Li, Daksh Mittal, Hongseok Namkoong, Shangzhou Xia

PDF

Open Access 1 Video

TL;DR

This paper introduces a scalable method to evaluate the worst-case performance of machine learning models across subpopulations defined by core attributes, addressing distributional robustness and intersectionality.

Contribution

It develops a two-stage estimation procedure with finite-sample guarantees that assesses model robustness over complex subpopulations, considering continuous attributes and intersectionality.

Findings

01

Method certifies model robustness on real datasets.

02

Procedure provides finite-sample convergence guarantees.

03

Evaluation error depends on attribute dimension and out-of-sample performance.

Abstract

The performance of ML models degrades when the training population is different from that seen under operation. Towards assessing distributional robustness, we study the worst-case performance of a model over all subpopulations of a given size, defined with respect to core attributes Z. This notion of robustness can consider arbitrary (continuous) attributes Z, and automatically accounts for complex intersectionality in disadvantaged groups. We develop a scalable yet principled two-stage estimation procedure that can evaluate the robustness of state-of-the-art models. We prove that our procedure enjoys several finite-sample convergence guarantees, including dimension-free convergence. Instead of overly conservative notions based on Rademacher complexities, our evaluation error depends on the dimension of Z only through the out-of-sample error in estimating the performance conditional on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Evaluating model performance under worst-case subpopulations· slideslive

Taxonomy

TopicsBayesian Modeling and Causal Inference · Simulation Techniques and Applications