Discovery of Paradigm Dependencies

Jizhou Sun; Jianzhong Li; Hong Gao

arXiv:1710.02817·cs.DB·October 10, 2017

Discovery of Paradigm Dependencies

Jizhou Sun, Jianzhong Li, Hong Gao

PDF

Open Access

TL;DR

This paper introduces Paradigm Dependencies (PDs), a new type of data dependency rule that captures partial string information for improved data quality management, along with a clustering and alignment framework to discover them.

Contribution

It proposes Paradigm Dependencies, a novel dependency rule type that considers parts of string values, and develops a clustering and alignment method to discover these dependencies efficiently.

Findings

01

PDs improve data quality handling for string attributes.

02

The proposed greedy algorithm effectively discovers PDs.

03

Experimental results validate the method's effectiveness and efficiency.

Abstract

Missing and incorrect values often cause serious consequences. To deal with these data quality problems, a class of common employed tools are dependency rules, such as Functional Dependencies (FDs), Conditional Functional Dependencies (CFDs) and Edition Rules (ERs), etc. The stronger expressing ability a dependency has, data with the better quality can be obtained. To the best of our knowledge, all previous dependencies treat each attribute value as a non-splittable whole. Actually however, in many applications, part of a value may contains meaningful information, indicating that more powerful dependency rules to handle data quality problems are possible. In this paper, we consider of discovering such type of dependencies in which the left hand side is part of a regular-expression-like paradigm, named Paradigm Dependencies (PDs). PDs tell that if a string matches the paradigm, element…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Mining Algorithms and Applications · Advanced Database Systems and Queries