Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

Tingkai Yan; Haodong Wen; Binghui Li; Kairong Luo; Wenguang Chen; Kaifeng Lyu

arXiv:2511.13421·cs.LG·March 16, 2026

Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

Tingkai Yan, Haodong Wen, Binghui Li, Kairong Luo, Wenguang Chen, Kaifeng Lyu

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical analysis of how repeating datasets across multiple epochs affects scaling laws in linear regression, revealing that larger datasets can be reused more times before benefits diminish.

Contribution

It introduces the effective reuse rate $E(K, N)$ to quantify data reuse benefits and characterizes its behavior for different epoch counts and data distributions, extending understanding of data scaling laws.

Findings

01

For small K, $E(K, N) \\approx K$, indicating linear gains.

02

As K increases, $E(K, N)$ plateaus at a value growing with N.

03

Larger datasets can be reused more times before marginal benefits vanish.

Abstract

While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size $N$ for $K$ epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textit{effective reuse rate} of the data, $E (K, N)$ , which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as $K$ -epoch training. Our analysis precisely characterizes the scaling behavior of $E (K, N)$ for SGD in linear regression under…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

This is a very nice paper. The core question in the paper is important and well-framed and finding an interesting but tractable theoretical analysis is a valuable contribution. Solving the linear regression problem in both the strongly convex and Zipf distribution settings is valuable and illustrated the dependence on the data distribution exponent. The proof sketch gave nice intuition about the approach and which techniques were used to bound which terms. The LLM experiments give useful validat

Weaknesses

All of the LLM experiments use a constant learning rate schedule with AdamW, rather than some form of learning rate decay (e.g. cosine) as is required for competitive performance in practice. This is a reasonable limitation of a primarily theoretical paper as using a time-horizon-dependent learning rate schedule would require training separate models for every different number of epochs, requiring substantially more compute. (Similarly, they use the same peak learning rate for all the training

Reviewer 02Rating 4Confidence 2

Strengths

1. The paper presents rigorous theoretical analysis with precise characterizations of the effective reuse rate E(K,N) under both strongly convex and Zipf-distributed settings. 2. The central insight that larger datasets can be repeated more times is clearly articulated and challenges existing assumptions in the field. 3. The theoretical predictions are thoroughly validated through two complementary approaches: controlled simulations on synthetic data and large-scale LLM pretraining experiments (

Weaknesses

1. The overall conclusions are very similar to the previous work "Improved scaling laws in linear regression via data reuse". 2. And the paper still lacks sufficient practical evidence from LLMs. It is well established that LLM performance differs significantly between large and small models. A more meaningful experiment would be to scale across different model sizes and examine how the effective reuse rate varies with model capacity.

Reviewer 03Rating 6Confidence 2

Strengths

1. The paper provides a principled and rigorous theoretical framework to analyze the widely used but poorly understood practice of multi-epoch training. The introduction of the "effective reuse rate" E(K, N) is a clear and valuable conceptual contribution. The key finding that the benefit of data reuse scales with the dataset size (N) is a significant insight. It provides a concrete guideline for practitioners: one can and should repeat larger datasets more times before expecting diminishing ret

Weaknesses

1. The primary limitation is the gap between the theoretical setting (linear regression with SGD) and the practical setting of interest (Transformer-based LLMs trained with AdamW). Linear models cannot capture the complex, non-linear feature learning that occurs in deep networks. While the qualitative findings transfer, the exact quantitative predictions (e.g., the saturation point scaling as Θ(log N)) may not hold for Transformers. This is a standard and often necessary simplification in theore

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Stochastic Gradient Optimization Techniques