Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning
Andreas Loizou, Dimitrios Tsoumakos

TL;DR
The paper introduces Chunked Data Shapley (C-DaSh), a scalable, efficient method for assessing dataset quality by approximating Data Shapley values, enabling practical evaluation on large datasets in machine learning.
Contribution
C-DaSh is a novel approach that divides datasets into chunks and efficiently estimates data point contributions, significantly reducing computation time compared to existing methods.
Findings
Achieves 80x to 2300x speedup over existing Shapley methods.
Effectively detects low-quality data regions in large datasets.
Maintains high accuracy in data quality assessment.
Abstract
As the volume and diversity of available datasets continue to increase, assessing data quality has become crucial for reliable and efficient Machine Learning analytics. A modern, game-theoretic approach for evaluating data quality is the notion of Data Shapley which quantifies the value of individual data points within a dataset. State-of-the-art methods to scale the NP-hard Shapley computation also face severe challenges when applied to large-scale datasets, limiting their practical use. In this work, we present a Data Shapley approach to identify a dataset's high-quality data tuples, Chunked Data Shapley (C-DaSh). C-DaSh scalably divides the dataset into manageable chunks and estimates the contribution of each chunk using optimized subset selection and single-iteration stochastic gradient descent. This approach drastically reduces computation time while preserving high quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
