Pattern-Driven Data Cleaning
El Kindi Rezig, Mourad Ouzzani, Walid G. Aref, Ahmed K. Elmagarmid,, Ahmed R. Mahmood

TL;DR
This paper introduces a new data cleaning approach that leverages functional dependency patterns to improve repair accuracy, interpretability, and scalability, outperforming existing methods in experiments.
Contribution
It formalizes pattern-preserving repairs, proposes an interpretable repair formalism, and develops a linear-time algorithm for efficient data repair.
Findings
Outperforms state-of-the-art algorithms in repair quality.
Achieves linear scalability in repair time.
Effectively preserves frequent data patterns.
Abstract
Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A well-studied class of integrity constraints is Functional Dependencies (FDs, for short) that specify dependencies among attributes in a relation. In this paper, we address three major challenges in data repairing: (1) Accuracy: Most existing techniques strive to produce repairs that minimize changes to the data. However, this process may produce incorrect combinations of attribute values (or patterns). In this work, we formalize the interaction of FD-induced patterns and select repairs that result in preserving frequent patterns found in the original data. This has the potential to yield a better repair quality both in terms of precision and recall.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Privacy-Preserving Technologies in Data
