HoloClean: Holistic Data Repairs with Probabilistic Inference

Theodoros Rekatsinas; Xu Chu; Ihab F. Ilyas; Christopher R\'e

arXiv:1702.00820·cs.DB·February 6, 2017·57 cites

HoloClean: Holistic Data Repairs with Probabilistic Inference

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher R\'e

PDF

Open Access

TL;DR

HoloClean is a probabilistic framework that unifies qualitative and quantitative data repairing methods, enabling scalable, accurate data repairs for large datasets by leveraging probabilistic inference.

Contribution

It introduces a novel probabilistic inference-based approach that unifies existing data repairing techniques and scales efficiently to large datasets.

Findings

01

Achieves ~90% precision in data repairs

02

Maintains above ~76% recall across datasets

03

Improves F1 score by over 2x compared to state-of-the-art

Abstract

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean scales to instances with millions of tuples and find data repairs with an average precision of ~90% and an average recall of above ~76% across a diverse array of datasets exhibiting different types of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Bayesian Modeling and Causal Inference · Explainable Artificial Intelligence (XAI)