Uncovering Drift in Textual Data: An Unsupervised Method for Detecting and Mitigating Drift in Machine Learning Models
Saeed Khaki, Akhouri Abhinav Aditya, Zohar Karnin, Lan Ma, Olivia Pan,, Samarth Marudheri Chandrashekar

TL;DR
This paper introduces an unsupervised method for detecting and mitigating data drift in machine learning models using kernel-based statistical tests, which improves model performance without requiring human annotations.
Contribution
The paper presents a novel unsupervised approach employing maximum mean discrepancy to detect drift and identify its root causes, enabling proactive model maintenance.
Findings
Effective drift detection via MMD-based statistical test
Identified high-drift data subsets improve retrained model performance
Reduces reliance on human annotation for drift detection
Abstract
Drift in machine learning refers to the phenomenon where the statistical properties of data or context, in which the model operates, change over time leading to a decrease in its performance. Therefore, maintaining a constant monitoring process for machine learning model performance is crucial in order to proactively prevent any potential performance regression. However, supervised drift detection methods require human annotation and consequently lead to a longer time to detect and mitigate the drift. In our proposed unsupervised drift detection method, we follow a two step process. Our first step involves encoding a sample of production data as the target distribution, and the model training data as the reference distribution. In the second step, we employ a kernel-based statistical test that utilizes the maximum mean discrepancy (MMD) distance metric to compare the reference and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Air Quality Monitoring and Forecasting · Machine Learning and Data Classification
