Is this model reliable for everyone? Testing for strong calibration
Jean Feng, Alexej Gossmann, Romain Pirracchio, Nicholas Petrick, Gene, Pennello, Berkman Sahiner

TL;DR
This paper introduces a novel changepoint detection method for testing strong calibration in risk prediction models, improving power over existing approaches especially for small or weakly calibrated subgroups.
Contribution
The authors develop a new calibration testing procedure based on residual reordering and changepoint detection, with an adaptive CUSUM test that incorporates cross-validation for enhanced power.
Findings
Higher power in simulation studies compared to existing methods
More than doubled power in mortality risk model auditing
Effective detection of poorly calibrated subgroups
Abstract
In a well-calibrated risk prediction model, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, the task of auditing a model for strong calibration is well-known to be difficult -- particularly for machine learning (ML) algorithms -- due to the sheer number of potential subgroups. As such, common practice is to only assess calibration with respect to a few predefined subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signal or where the poorly calibrated subgroup is small, as they either overly subdivide the data or fail to divide the data at all. We introduce a new testing procedure based on the following insight: if we can reorder observations by their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Advanced Causal Inference Techniques · Statistical Methods in Clinical Trials
Methodsfail
