Is this model reliable for everyone? Testing for strong calibration

Jean Feng; Alexej Gossmann; Romain Pirracchio; Nicholas Petrick; Gene; Pennello; Berkman Sahiner

arXiv:2307.15247·cs.LG·July 31, 2023·1 cites

Is this model reliable for everyone? Testing for strong calibration

Jean Feng, Alexej Gossmann, Romain Pirracchio, Nicholas Petrick, Gene, Pennello, Berkman Sahiner

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel changepoint detection method for testing strong calibration in risk prediction models, improving power over existing approaches especially for small or weakly calibrated subgroups.

Contribution

The authors develop a new calibration testing procedure based on residual reordering and changepoint detection, with an adaptive CUSUM test that incorporates cross-validation for enhanced power.

Findings

01

Higher power in simulation studies compared to existing methods

02

More than doubled power in mortality risk model auditing

03

Effective detection of poorly calibrated subgroups

Abstract

In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jjfeng/testing_strong_calibration
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Causal Inference Techniques · Statistical Methods in Clinical Trials

Methodsfail