Statistical Learning to Operationalize a Domain Agnostic Data Quality Scoring
Sezal Chug, Priya Kaushal, Ponnurangam Kumaraguru, Tavpritesh Sethi

TL;DR
This paper presents an automated, domain-agnostic data quality scoring platform that assesses datasets using a comprehensive metric derived from multiple quality indicators, aiding data scientists in evaluating data reliability.
Contribution
The study introduces a novel automated platform that generates a data quality score, label, and report using PCA-based metrics across diverse datasets, filling a gap in practical data quality assessment.
Findings
Developed a metric with nine quality ingredients
Validated the metric using mutation testing
Demonstrated the platform's effectiveness on real datasets
Abstract
Data is expanding at an unimaginable rate, and with this development comes the responsibility of the quality of data. Data Quality refers to the relevance of the information present and helps in various operations like decision making and planning in a particular organization. Mostly data quality is measured on an ad-hoc basis, and hence none of the developed concepts provide any practical application. The current empirical study was undertaken to formulate a concrete automated data quality platform to assess the quality of incoming dataset and generate a quality label, score and comprehensive report. We utilize various datasets from healthdata.gov, opendata.nhs and Demographics and Health Surveys (DHS) Program to observe the variations in the quality score and formulate a label using Principal Component Analysis(PCA). The results of the current empirical study revealed a metric that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Big Data and Business Intelligence · Big Data Technologies and Applications
