Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare
Elizabeth W. Miller, Jeffrey D. Blume

TL;DR
This paper introduces diagnostics to measure and evaluate individual-level prediction stability in healthcare machine learning models, highlighting the importance of stability for clinical trust.
Contribution
It proposes a framework with two diagnostics, empirical prediction interval width and decision flip rate, to quantify individual prediction variability.
Findings
Randomness from optimization can cause variability comparable to data resampling.
Neural networks show greater instability than logistic regression.
Instability near decision thresholds can change treatment recommendations.
Abstract
In healthcare, predictive models increasingly inform patient-level decisions, yet little attention is paid to the variability in individual risk estimates and its impact on treatment decisions. For overparameterized models, now standard in machine learning, a substantial source of variability often goes undetected. Even when the data and model architecture are held fixed, randomness introduced by optimization and initialization can lead to materially different risk estimates for the same patient. This problem is largely obscured by standard evaluation practices, which rely on aggregate performance metrics (e.g., log-loss, accuracy) that are agnostic to individual-level stability. As a result, models with indistinguishable aggregate performance can nonetheless exhibit substantial procedural arbitrariness, which can undermine clinical trust. We propose an evaluation framework that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
