Data Quality Evaluation using Probability Models

Allen ONeill

arXiv:2009.06672·cs.LG·September 16, 2020·1 cites

Data Quality Evaluation using Probability Models

Allen ONeill

PDF

Open Access

TL;DR

This paper explores using machine-learning probability models, specifically decision trees, to evaluate data quality by distinguishing good from bad data, demonstrating promising accuracy with simple labeled examples.

Contribution

It introduces a domain-agnostic approach employing decision trees for data quality assessment based on labeled examples, highlighting its potential and limitations.

Findings

01

Decision trees can accurately predict data quality with labeled examples.

02

The approach is simple and domain-independent.

03

Limitations exist for complex or nuanced data quality scenarios.

Abstract

This paper discusses an approach with machine-learning probability models to evaluate the difference between good and bad data quality in a dataset. A decision tree algorithm is used to predict data quality based on no domain knowledge of the datasets under examination. It is shown that for the data examined, the ability to predict the quality of data based on simple good/bad pre-labelled learning examples is accurate, however in general it may not be sufficient for useful production data quality assessment.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Big Data and Business Intelligence · Data Mining Algorithms and Applications