Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning

Andreas Loizou; Dimitrios Tsoumakos

arXiv:2508.16255·cs.LG·August 25, 2025

Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning

Andreas Loizou, Dimitrios Tsoumakos

PDF

TL;DR

The paper introduces Chunked Data Shapley (C-DaSh), a scalable, efficient method for assessing dataset quality by approximating Data Shapley values, enabling practical evaluation on large datasets in machine learning.

Contribution

C-DaSh is a novel approach that divides datasets into chunks and efficiently estimates data point contributions, significantly reducing computation time compared to existing methods.

Findings

01

Achieves 80x to 2300x speedup over existing Shapley methods.

02

Effectively detects low-quality data regions in large datasets.

03

Maintains high accuracy in data quality assessment.

Abstract

As the volume and diversity of available datasets continue to increase, assessing data quality has become crucial for reliable and efficient Machine Learning analytics. A modern, game-theoretic approach for evaluating data quality is the notion of Data Shapley which quantifies the value of individual data points within a dataset. State-of-the-art methods to scale the NP-hard Shapley computation also face severe challenges when applied to large-scale datasets, limiting their practical use. In this work, we present a Data Shapley approach to identify a dataset's high-quality data tuples, Chunked Data Shapley (C-DaSh). C-DaSh scalably divides the dataset into manageable chunks and estimates the contribution of each chunk using optimized subset selection and single-iteration stochastic gradient descent. This approach drastically reduces computation time while preserving high quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.