Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value
Le Ma, Shirao Yang, Zihao Wang, Yinggui Wang, Lei Wang, Tao Wei, Kejun Zhang

TL;DR
This paper introduces Unlearning Shapley, a scalable data valuation method using machine unlearning and Shapley values, enabling efficient and practical contribution measurement of data in large models without full retraining.
Contribution
It proposes a novel framework that combines machine unlearning with Monte Carlo sampling to estimate data values efficiently, supporting partial data valuation for large models.
Findings
Matches state-of-the-art accuracy in data valuation
Reduces computational costs by orders of magnitude
Strong correlation between estimated and true data impact
Abstract
The proliferation of large models has intensified the need for efficient data valuation methods to quantify the contribution of individual data providers. Traditional approaches, such as game-theory-based Shapley value and influence-function-based techniques, face prohibitive computational costs or require access to full data and model training details, making them hardly achieve partial data valuation. To address this, we propose Unlearning Shapley, a novel framework that leverages machine unlearning to estimate data values efficiently. By unlearning target data from a pretrained model and measuring performance shifts on a reachable test set, our method computes Shapley values via Monte Carlo sampling, avoiding retraining and eliminating dependence on full data. Crucially, Unlearning Shapley supports both full and partial data valuation, making it scalable for large models (e.g., LLMs)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Mobile Crowdsensing and Crowdsourcing · Privacy-Preserving Technologies in Data
