AlphaClean: Automatic Generation of Data Cleaning Pipelines
Sanjay Krishnan, Eugene Wu

TL;DR
AlphaClean is an automated framework that efficiently generates high-quality data cleaning pipelines by optimizing parameter tuning and search strategies, outperforming existing methods in quality and robustness.
Contribution
It introduces a novel generate-then-search framework for data cleaning pipeline tuning, incorporating incremental evaluation and dynamic pruning for improved performance.
Findings
AlphaClean achieves up to 9x higher quality solutions.
It is more robust to data cleaning method variability.
Can integrate systems like HoloClean as operators.
Abstract
The analyst effort in data cleaning is gradually shifting away from the design of hand-written scripts to building and tuning complex pipelines of automated data cleaning libraries. Hyper-parameter tuning for data cleaning is very different than hyper-parameter tuning for machine learning since the pipeline components and objective functions have structure that tuning algorithms can exploit. This paper proposes a framework, called AlphaClean, that rethinks parameter tuning for data cleaning pipelines. AlphaClean provides users with a rich library to define data quality measures with weighted sums of SQL aggregate queries. AlphaClean applies generate-then-search framework where each pipelined cleaning operator contributes candidate transformations to a shared pool. Asynchronously, in separate threads, a search algorithm sequences them into cleaning pipelines that maximize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Advanced Database Systems and Queries
