Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

Ruoxi Jia; David Dao; Boxin Wang; Frances Ann Hubis; Nezihe Merve; Gurel; Bo Li; Ce Zhang; Costas J. Spanos; Dawn Song

arXiv:1908.08619·cs.LG·March 31, 2020

Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms

Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve, Gurel, Bo Li, Ce Zhang, Costas J. Spanos, Dawn Song

PDF

3 Repos

TL;DR

This paper introduces efficient algorithms for computing the Shapley value for data valuation in K-nearest neighbor models, significantly reducing computational complexity and enabling practical valuation at large scales.

Contribution

It presents the first exact $O(N \,\log N)$ algorithm for Shapley value computation in unweighted KNN and develops a sublinear LSH-based approximation method, extending valuation to various scenarios.

Findings

01

Exact algorithm runs up to 1000x faster than baseline.

02

LSH-based approximation achieves sublinear complexity.

03

Algorithms scale to datasets with up to 10 million points.

Abstract

Given a data set $D$ containing millions of data points and a data consumer who is willing to pay for $ $X$ to train a machine learning (ML) model over $D$ , how should we distribute this $ $X$ to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all $N$ data points, it requires $O (2^{N})$ model evaluations for exact computation and $O (N lo g N)$ for $(ϵ, δ)$ -approximation. In this paper, we focus on one popular family of ML models relying on $K$ -nearest neighbors ( $K$ NN). The most surprising result is that for unweighted $K$ NN classifiers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.