Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets
Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma, Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta,, Sandeep Hans, Pranay Lohia, Aniya Aggarwal, Diptikalyan Saha

TL;DR
This paper introduces the Data Quality Toolkit, a system designed to automatically assess, explain, and remediate data issues specifically for machine learning datasets, improving data readiness and pipeline efficiency.
Contribution
The paper presents a novel toolkit that targets machine learning-specific data quality issues, integrating detection, explanation, and remediation in an automated manner.
Findings
Toolkit reduces data preparation time
Automates detection of ML-specific data issues
Available via IBM API Hub for easy access
Abstract
The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Various tools and techniques are available that assess data quality with respect to general cleaning and profiling checks. However these techniques are not applicable to detect data issues in the context of machine learning tasks, like noisy labels, existence of overlapping classes etc. We attempt to re-look at the data quality issues in the context of building a machine learning pipeline and build a tool that can detect, explain and remediate issues in the data, and systematically and automatically capture all the changes applied to the data. We introduce the Data Quality Toolkit for machine learning as a library of some key quality metrics and relevant remediation techniques to analyze and enhance the readiness of structured training datasets for machine learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Data Quality and Management · Machine Learning and Data Classification
