Mining CFD Rules on Big Data

Hongzhi Wang; Mingda Li; Jiawei Zhao; Jianzhong Li; Hong; Gao

arXiv:1808.01621·cs.DB·August 7, 2018

Mining CFD Rules on Big Data

Hongzhi Wang, Mingda Li, Jiawei Zhao, Jianzhong Li, Hong, Gao

PDF

Open Access

TL;DR

This paper presents a scalable approach for discovering conditional functional dependencies (CFDs) on big data by using sampling, fault-tolerance, and conflict resolution techniques, enabling effective rule discovery on billion-tuple datasets.

Contribution

It introduces a comprehensive framework combining sampling, fault-tolerance, and parameter tuning for CFD discovery tailored to large, low-quality datasets.

Findings

01

Effective CFD rules discovered on billion-tuple data

02

Method reduces computational time significantly

03

Framework handles low-quality and voluminous data

Abstract

Current conditional functional dependencies (CFDs) discovery algorithms always need a well-prepared training data set. This makes them difficult to be applied on large datasets which are always in low-quality. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. For the low-quality issue of big data, we then design the fault-tolerant rule discovery algorithm and the conflict resolution algorithm. We also propose parameter selection strategy for CFD discovery algorithm to ensure its effectiveness. Experimental results demonstrate that our method could discover effective CFD rules on billion-tuple data within reasonable time.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Mining Algorithms and Applications · Rough Sets and Fuzzy Logic