Scaling Laws for the Value of Individual Data Points in Machine Learning
Ian Covert, Wenlong Ji, Tatsunori Hashimoto, James Zou

TL;DR
This paper investigates how the value of individual data points in machine learning models scales with dataset size, revealing variability among data points and providing methods for data valuation and subset selection.
Contribution
It introduces a novel scaling law for individual data point value, supported by theory and empirical validation, and develops estimators for practical application.
Findings
Data point value decreases log-linearly with dataset size.
Significant variability exists in data point scaling exponents.
Proposed estimators effectively learn individual scaling behaviors.
Abstract
Recent works have shown that machine learning models improve at a predictable rate with the total amount of training data, leading to scaling laws that describe the relationship between error and dataset size. These scaling laws can help design a model's training dataset, but they typically take an aggregate view of the data by only considering the dataset's size. We introduce a new perspective by investigating scaling behavior for the value of individual data points: we find that a data point's contribution to model's performance shrinks predictably with the size of the dataset in a log-linear manner. Interestingly, there is significant variability in the scaling exponent among different data points, indicating that certain points are more valuable in small datasets while others are relatively more useful as a part of large datasets. We provide learning theory to support our scaling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Stochastic processes and financial applications · Statistical Methods and Inference
