BayesWipe: A Scalable Probabilistic Framework for Cleaning BigData

Sushovan De; Yuheng Hu; Meduri Venkata Vamsikrishna; Yi Chen; and; Subbarao Kambhampati

arXiv:1506.08908·cs.DB·July 1, 2015·1 cites

BayesWipe: A Scalable Probabilistic Framework for Cleaning BigData

Sushovan De, Yuheng Hu, Meduri Venkata Vamsikrishna, Yi Chen, and, Subbarao Kambhampati

PDF

Open Access

TL;DR

BayesWipe introduces a scalable probabilistic framework that automatically corrects attribute errors in large structured databases using Bayesian models, eliminating the need for domain expertise or clean data samples.

Contribution

It presents a novel Bayesian approach for attribute correction in big data, learning error models directly from noisy data and enabling consistent query answering without write access.

Findings

01

Effective correction of attribute errors demonstrated on synthetic and real datasets

02

Avoids reliance on domain experts or clean samples for data cleaning

03

Supports consistent query answering in the presence of data errors

Abstract

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Data Mining Algorithms and Applications