HoloClean: Holistic Data Repairs with Probabilistic Inference
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher R\'e

TL;DR
HoloClean is a probabilistic framework that unifies qualitative and quantitative data repairing methods, enabling scalable, accurate data repairs for large datasets by leveraging probabilistic inference.
Contribution
It introduces a novel probabilistic inference-based approach that unifies existing data repairing techniques and scales efficiently to large datasets.
Findings
Achieves ~90% precision in data repairs
Maintains above ~76% recall across datasets
Improves F1 score by over 2x compared to state-of-the-art
Abstract
We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean scales to instances with millions of tuples and find data repairs with an average precision of ~90% and an average recall of above ~76% across a diverse array of datasets exhibiting different types of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Bayesian Modeling and Causal Inference · Explainable Artificial Intelligence (XAI)
