DUPRE: Data Utility Prediction for Efficient Data Valuation

Kieu Thao Nguyen Pham; Rachael Hwee Ling Sim; Quoc Phong Nguyen; See Kiong Ng; Bryan Kian Hsiang Low

arXiv:2502.16152·cs.LG·July 9, 2025

DUPRE: Data Utility Prediction for Efficient Data Valuation

Kieu Thao Nguyen Pham, Rachael Hwee Ling Sim, Quoc Phong Nguyen, See Kiong Ng, Bryan Kian Hsiang Low

PDF

1 Repo

TL;DR

DUPRE introduces a Gaussian process-based framework that predicts data utility for efficient data valuation in machine learning, reducing the need for costly model retraining and enabling faster, more accurate data valuation.

Contribution

The paper presents DUPRE, a novel approach that uses Gaussian process regression with a sliced Wasserstein distance kernel to predict data utility, significantly reducing evaluation costs.

Findings

01

DUPRE achieves low prediction error in utility estimation.

02

It significantly speeds up data valuation processes.

03

The method is effective across various datasets and models.

Abstract

Data valuation is increasingly used in machine learning (ML) to decide the fair compensation for data owners and identify valuable or harmful data for improving ML models. Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility (e.g., validation accuracy) and retraining the ML model for multiple data subsets. While most existing works on efficient estimation of the Shapley values have focused on reducing the number of subsets to evaluate, our framework, \texttt{DUPRE}, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, \texttt{DUPRE} fits a \emph{Gaussian process} (GP) regression model to predict the utility of every other data subset. Our key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kakaeriol/uncertainty_shapley
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.