ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, Ken, Goldberg

TL;DR
ActiveClean is an interactive data cleaning method that incrementally updates convex loss models, prioritizing data likely to impact results, leading to more accurate models with less cleaning effort.
Contribution
It introduces ActiveClean, a progressive cleaning approach that guarantees model accuracy on partially cleaned data and efficiently prioritizes data cleaning based on model structure.
Findings
Improves model accuracy up to 2.5x with the same cleaning effort
Outperforms uniform sampling and active learning in real datasets
Supports convex loss models like linear regression and SVMs
Abstract
Data cleaning is often an important step to ensure that predictive models, such as regression and classification, are not affected by systematic errors such as inconsistent, out-of-date, or outlier data. Identifying dirty data is often a manual and iterative process, and can be challenging on large datasets. However, many data cleaning workflows can introduce subtle biases into the training processes due to violation of independence assumptions. We propose ActiveClean, a progressive cleaning approach where the model is updated incrementally instead of re-training and can guarantee accuracy on partially cleaned data. ActiveClean supports a popular class of models called convex loss models (e.g., linear regression and SVMs). ActiveClean also leverages the structure of a user's model to prioritize cleaning those records likely to affect the results. We evaluate ActiveClean on five…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Explainable Artificial Intelligence (XAI)
MethodsLinear Regression
