TL;DR
This paper introduces D3M, a label-free monitoring method that detects model performance deterioration in deployment by analyzing predictive disagreement, effectively alerting for necessary retraining in dynamic environments.
Contribution
It formalizes the post-deployment deterioration detection problem and proposes D3M, a novel, practical algorithm with theoretical guarantees and empirical validation.
Findings
Low false positive rates under non-deteriorating shifts
High true positive detection with sample complexity bounds
Effective on benchmark and real-world datasets
Abstract
The distribution of data changes over time; models operating in dynamic environments need retraining. But knowing when to retrain, without access to labels, is an open challenge since some, but not all shifts degrade model performance. This paper formalizes and addresses the problem of post-deployment deterioration (PDD) monitoring. We propose D3M, a practical and efficient monitoring algorithm based on the disagreement of predictive models, achieving low false positive rates under non-deteriorating shifts and provides sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale internal medicine dataset demonstrate the effectiveness of the framework and highlight its viability as an alert mechanism for high-stakes machine learning pipelines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
