Towards Explainable Automated Data Quality Enhancement without Domain   Knowledge

Djibril Sarr

arXiv:2409.10139·cs.DB·September 17, 2024

Towards Explainable Automated Data Quality Enhancement without Domain Knowledge

Djibril Sarr

PDF

Open Access

TL;DR

This paper presents a hybrid, explainable framework for automatic data quality assessment and correction applicable to various datasets, focusing on transparency and balancing accuracy with efficiency.

Contribution

It introduces a novel, explainable approach combining statistical and machine learning methods for automated data quality enhancement without domain knowledge.

Findings

01

Effective detection of missing values, duplicates, and typos.

02

Demonstrated approach balances accuracy and explainability.

03

Challenges remain in detecting outliers and logical errors.

Abstract

In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset, regardless of its specific content, focusing on both textual and numerical data. Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence. At the heart of our approach lies a rigorous demand for both explainability and interpretability, ensuring that the rationale behind the identification and correction of data anomalies is transparent and understandable. To achieve this, we adopt a hybrid approach that integrates statistical methods with machine learning algorithms. Indeed, by leveraging statistical techniques alongside machine learning, we strike a balance between accuracy and explainability,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Mining Algorithms and Applications · Big Data Technologies and Applications

MethodsSparse Evolutionary Training