MechDetect: Detecting Data-Dependent Errors
Philipp Jung, Nicholas Chandler, Sebastian J\"ager, Felix Biessmann

TL;DR
MechDetect is a machine learning-based algorithm designed to identify whether data errors depend on the data itself, aiding in understanding error generation mechanisms for improved data quality management.
Contribution
It extends existing methods for detecting missing value mechanisms to general data errors using an error mask, providing a new tool for error mechanism analysis.
Findings
Effective in identifying data-dependent errors
Applicable to various error types with an error mask
Demonstrated success on benchmark datasets
Abstract
Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Software System Performance and Reliability
