A Distributional Framework for Data Valuation
Amirata Ghorbani, Michael P. Kim, James Zou

TL;DR
This paper introduces distributional Shapley, a new framework for data valuation that considers the data distribution, offering statistical stability, faster computation, and applicability beyond fixed datasets.
Contribution
The paper proposes distributional Shapley, extending data valuation to the data distribution, with theoretical properties, a new estimation algorithm, and practical applications.
Findings
Distributional Shapley values are stable under data perturbations.
The new algorithm is two orders of magnitude faster than existing methods.
Application to data markets demonstrates practical utility.
Abstract
Shapley value is a classic notion from game theory, historically used to quantify the contributions of individuals within groups, and more recently applied to assign values to data points when training machine learning models. Despite its foundational role, a key limitation of the data Shapley framework is that it only provides valuations for points within a fixed data set. It does not account for statistical aspects of the data and does not give a way to reason about points outside the data set. To address these limitations, we propose a novel framework -- distributional Shapley -- where the value of a point is defined in the context of an underlying data distribution. We prove that distributional Shapley has several desirable statistical properties; for example, the values are stable under perturbations to the data points themselves and to the underlying data distribution. We leverage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Auction Theory and Applications · Sports Analytics and Performance
