Towards Explainable Automated Data Quality Enhancement without Domain Knowledge
Djibril Sarr

TL;DR
This paper presents a hybrid, explainable framework for automatic data quality assessment and correction applicable to various datasets, focusing on transparency and balancing accuracy with efficiency.
Contribution
It introduces a novel, explainable approach combining statistical and machine learning methods for automated data quality enhancement without domain knowledge.
Findings
Effective detection of missing values, duplicates, and typos.
Demonstrated approach balances accuracy and explainability.
Challenges remain in detecting outliers and logical errors.
Abstract
In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset, regardless of its specific content, focusing on both textual and numerical data. Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence. At the heart of our approach lies a rigorous demand for both explainability and interpretability, ensuring that the rationale behind the identification and correction of data anomalies is transparent and understandable. To achieve this, we adopt a hybrid approach that integrates statistical methods with machine learning algorithms. Indeed, by leveraging statistical techniques alongside machine learning, we strike a balance between accuracy and explainability,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Big Data Technologies and Applications
MethodsSparse Evolutionary Training
