TL;DR
This paper introduces efficient algorithms for computing the Shapley value for data valuation in K-nearest neighbor models, significantly reducing computational complexity and enabling practical valuation at large scales.
Contribution
It presents the first exact $O(N \,\log N)$ algorithm for Shapley value computation in unweighted KNN and develops a sublinear LSH-based approximation method, extending valuation to various scenarios.
Findings
Exact algorithm runs up to 1000x faster than baseline.
LSH-based approximation achieves sublinear complexity.
Algorithms scale to datasets with up to 10 million points.
Abstract
Given a data set containing millions of data points and a data consumer who is willing to pay for $ to train a machine learning (ML) model over , how should we distribute this $ to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all data points, it requires model evaluations for exact computation and for -approximation. In this paper, we focus on one popular family of ML models relying on -nearest neighbors (NN). The most surprising result is that for unweighted NN classifiers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
