BoostClean: Automated Error Detection and Repair for Machine Learning
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Eugene Wu

TL;DR
BoostClean is an automated system that detects and repairs data errors in machine learning datasets, improving model accuracy by intelligently combining multiple error detection and repair methods using statistical boosting.
Contribution
It introduces BoostClean, a novel ensemble-based approach that automatically selects effective error detection and repair strategies, including a new Word2Vec-based detector, to enhance data quality for ML models.
Findings
BoostClean improves prediction accuracy by up to 9%.
It achieves a 22.2x speedup with optimizations.
Effective error detection across diverse datasets.
Abstract
Predictive models based on machine learning can be highly sensitive to data error. Training data are often combined with a variety of different sources, each susceptible to different types of inconsistencies, and new data streams during prediction time, the model may encounter previously unseen inconsistencies. An important class of such inconsistencies is domain value violations that occur when an attribute value is outside of an allowed domain. We explore automatically detecting and repairing such violations by leveraging the often available clean test labels to determine whether a given detection and repair combination will improve model accuracy. We present BoostClean which automatically selects an ensemble of error detection and repair combinations using statistical boosting. BoostClean selects this ensemble from an extensible library that is pre-populated general detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Imbalanced Data Classification Techniques · Anomaly Detection Techniques and Applications
