Data Quality Evaluation using Probability Models
Allen ONeill

TL;DR
This paper explores using machine-learning probability models, specifically decision trees, to evaluate data quality by distinguishing good from bad data, demonstrating promising accuracy with simple labeled examples.
Contribution
It introduces a domain-agnostic approach employing decision trees for data quality assessment based on labeled examples, highlighting its potential and limitations.
Findings
Decision trees can accurately predict data quality with labeled examples.
The approach is simple and domain-independent.
Limitations exist for complex or nuanced data quality scenarios.
Abstract
This paper discusses an approach with machine-learning probability models to evaluate the difference between good and bad data quality in a dataset. A decision tree algorithm is used to predict data quality based on no domain knowledge of the datasets under examination. It is shown that for the data examined, the ability to predict the quality of data based on simple good/bad pre-labelled learning examples is accurate, however in general it may not be sufficient for useful production data quality assessment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Big Data and Business Intelligence · Data Mining Algorithms and Applications
