A Survey on Data Quality Dimensions and Tools for Machine Learning
Yuhan Zhou, Fengjiao Tu, Kewei Sha, Junhua Ding, Haihua Chen

TL;DR
This survey reviews recent data quality tools for machine learning, discusses their strengths and limitations, and explores emerging trends like large language models to improve data quality in data-centric AI.
Contribution
It provides a comprehensive comparison of 17 data quality tools, introduces a framework of DQ dimensions and metrics, and proposes a roadmap for future open-source DQ tool development.
Findings
Comparison of 17 DQ tools and their features
Identification of challenges and emerging trends in DQ for ML
Potential of LLMs and generative AI in DQ evaluation
Abstract
Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Big Data Technologies and Applications
