TL;DR
DUPRE introduces a Gaussian process-based framework that predicts data utility for efficient data valuation in machine learning, reducing the need for costly model retraining and enabling faster, more accurate data valuation.
Contribution
The paper presents DUPRE, a novel approach that uses Gaussian process regression with a sliced Wasserstein distance kernel to predict data utility, significantly reducing evaluation costs.
Findings
DUPRE achieves low prediction error in utility estimation.
It significantly speeds up data valuation processes.
The method is effective across various datasets and models.
Abstract
Data valuation is increasingly used in machine learning (ML) to decide the fair compensation for data owners and identify valuable or harmful data for improving ML models. Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility (e.g., validation accuracy) and retraining the ML model for multiple data subsets. While most existing works on efficient estimation of the Shapley values have focused on reducing the number of subsets to evaluate, our framework, \texttt{DUPRE}, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, \texttt{DUPRE} fits a \emph{Gaussian process} (GP) regression model to predict the utility of every other data subset. Our key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
