Data Shapley in One Training Run
Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

TL;DR
This paper introduces In-Run Data Shapley, a scalable method for attributing data contribution to a specific model obtained from a single training run, enabling efficient analysis of large-scale models and pretraining data.
Contribution
It proposes a novel, efficient approach to data attribution that works within a single training run, overcoming the computational limitations of existing Data Shapley methods.
Findings
Enables data attribution for large models during pretraining.
Achieves negligible additional runtime compared to standard training.
Provides insights into pretraining data's contribution and copyright implications.
Abstract
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model…
Peer Reviews
Decision·ICLR 2025 Oral
The proposed method is novel, efficient and promising. The presentation of the paper is very clear and succinct; technical contributions are clearly stated and rigorously developed. The authors carefully position their work and acknowledge several limitations which they claim will be a part of future work. Empirical evaluation is thorough and the paper raises important observations re: societal implications of the large model training.
I do not see any major weaknesses of the paper. Minor (typo): line 500 - sentence "The results of this experiment..." appears twice in a row
1. The motivations, problem statement and objectives and contributions are provided clearly. 2. The mathematical induction and proof seem solid. 3. The results and implications of the new concept/techniques are described clearly.
Compared with the introduction, background, methods and mathematical induction, which are presented clearly, the results and analyses are a bit weak. It would be better to provide concrete examples with more deep analyses.
* The introduction of In-Run Data Shapley is a significant advancement over existing retraining-based methods, offering a scalable solution to data attribution in large models. * The case studies, particularly in data curation and copyright implications, highlight the real-world relevance of the proposed method. * The closed-form derivations for the Shapley values using first- and second-order Taylor approximations are mathematically sound and well-explained.
* The method's requirement for validation data to be available before training might limit its applicability in scenarios where such data is not readily available. * Even though the authors address memory concerns using gradient accumulation, the practicality of this approach in extremely large models with limited resources is not fully explored. * The first- and second-order Taylor approximations are empirically validated, but there is limited discussion on how these approximations might affect
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEdcuational Technology Systems · Teaching and Learning Programming
