Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression
Tingkai Yan, Haodong Wen, Binghui Li, Kairong Luo, Wenguang Chen, Kaifeng Lyu

TL;DR
This paper provides a theoretical analysis of how repeating datasets across multiple epochs affects scaling laws in linear regression, revealing that larger datasets can be reused more times before benefits diminish.
Contribution
It introduces the effective reuse rate $E(K, N)$ to quantify data reuse benefits and characterizes its behavior for different epoch counts and data distributions, extending understanding of data scaling laws.
Findings
For small K, $E(K, N) \\approx K$, indicating linear gains.
As K increases, $E(K, N)$ plateaus at a value growing with N.
Larger datasets can be reused more times before marginal benefits vanish.
Abstract
While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size for epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textit{effective reuse rate} of the data, , which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as -epoch training. Our analysis precisely characterizes the scaling behavior of for SGD in linear regression under…
Peer Reviews
Decision·ICLR 2026 Poster
This is a very nice paper. The core question in the paper is important and well-framed and finding an interesting but tractable theoretical analysis is a valuable contribution. Solving the linear regression problem in both the strongly convex and Zipf distribution settings is valuable and illustrated the dependence on the data distribution exponent. The proof sketch gave nice intuition about the approach and which techniques were used to bound which terms. The LLM experiments give useful validat
All of the LLM experiments use a constant learning rate schedule with AdamW, rather than some form of learning rate decay (e.g. cosine) as is required for competitive performance in practice. This is a reasonable limitation of a primarily theoretical paper as using a time-horizon-dependent learning rate schedule would require training separate models for every different number of epochs, requiring substantially more compute. (Similarly, they use the same peak learning rate for all the training
1. The paper presents rigorous theoretical analysis with precise characterizations of the effective reuse rate E(K,N) under both strongly convex and Zipf-distributed settings. 2. The central insight that larger datasets can be repeated more times is clearly articulated and challenges existing assumptions in the field. 3. The theoretical predictions are thoroughly validated through two complementary approaches: controlled simulations on synthetic data and large-scale LLM pretraining experiments (
1. The overall conclusions are very similar to the previous work "Improved scaling laws in linear regression via data reuse". 2. And the paper still lacks sufficient practical evidence from LLMs. It is well established that LLM performance differs significantly between large and small models. A more meaningful experiment would be to scale across different model sizes and examine how the effective reuse rate varies with model capacity.
1. The paper provides a principled and rigorous theoretical framework to analyze the widely used but poorly understood practice of multi-epoch training. The introduction of the "effective reuse rate" E(K, N) is a clear and valuable conceptual contribution. The key finding that the benefit of data reuse scales with the dataset size (N) is a significant insight. It provides a concrete guideline for practitioners: one can and should repeat larger datasets more times before expecting diminishing ret
1. The primary limitation is the gap between the theoretical setting (linear regression with SGD) and the practical setting of interest (Transformer-based LLMs trained with AdamW). Linear models cannot capture the complex, non-linear feature learning that occurs in deep networks. While the qualitative findings transfer, the exact quantitative predictions (e.g., the saturation point scaling as Θ(log N)) may not hold for Transformers. This is a standard and often necessary simplification in theore
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Stochastic Gradient Optimization Techniques
