A Formal Framework For Probabilistic Unclean Databases
Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Re,, Theodoros Rekatsinas

TL;DR
This paper introduces a probabilistic framework for unclean databases that combines statistical models of data generation and noise, enabling effective data cleaning, query answering, and model learning.
Contribution
It proposes the PUD framework integrating probabilistic models for clean data and noise, extending traditional repair concepts, and demonstrates learnability from single dirty instances.
Findings
Defines the PUD framework with intention, realization, and observation.
Shows PUD generalizes traditional data repair concepts.
Proves tractability and learnability of models from data.
Abstract
Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by these empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database instance. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and a dirty observed database instance that we call…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
