How Does Critical Batch Size Scale in Pre-training?
Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou,, Udaya Ghai, Dean Foster, and Sham Kakade

TL;DR
This paper investigates how the critical batch size in large-scale language model pre-training scales with model and data size, revealing it primarily depends on data size and providing theoretical insights.
Contribution
It introduces a measure of critical batch size, systematically studies its scaling behavior, and offers theoretical justification for its dependence on data rather than model size.
Findings
Critical batch size scales mainly with data size.
Scaling laws are fitted for model and data sizes.
Theoretical analysis supports empirical findings.
Abstract
Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper provides an interesting finding that the critical batch size scales mostly with data set size, and is largely invariant to model size. This is a relevant and, to my knowledge, novel insight. - The paper considers models ranging from 85 million to 1.2 billion parameters and thus covers a reasonably large domain of models. - I really liked the highlighted practical takeaway blocks throughout the paper, which made it easy to understand, well-structured, and accessible.
Some of the takeaways seem to me a bit too bold or not backed by enough evidence for the given claim. - For example, in Section 2.2, they compare the efficiency of learning rate schedules across batch sizes by comparing the number of steps to achieve a given target validation loss. They conclude that "EWA consistently improves model training efficiency. [...] while outperforming Cosine for large batch sizes [...] and even with appropriate learning rate decay, [Cosine] underperforms our constant
- The paper is well-written with clear organization. This work would be a valuable contribution to the ICLR community. I especially appreciated the “key takeaways summary” after each section. - The experimental design is rigorous; for example, decoupling various hyperparameters makes the claims more convincing. - Detailed experimental procedures are provided in Appendix D. - The formalization of CBS (beyond [1] “An empirical model of large-batch training”) would be helpful in the literature. The
- The experimental scope is limited to models up to 1.2B parameters trained on C4, which may not fully capture scaling behaviors at larger scales (e.g., models with over 50B parameters). On a similar note, key ablation studies are primarily conducted on smaller models (with C4). However, given the careful experimental design and clear theoretical analysis, I do not believe that these impact the validity of the findings. - It would be helpful to have a dedicated section discussing the limitations
1. The paper is well-written and largely easy to follow. 2. The related literature covers the important papers in the topic well. 3. Formalizations and hypotheses are clearly outlined and help understand results better. 4. Though personally I would like to reconsider its exact design and placement, the _Takeaway_ block was helpful while reading the paper first time. 5. The model scales reported in experiments are adequate in applying the insights to large-scale pre-training. 6. Formalizing the n
1. Some lower scale experiment with repetitions over different seeds to show the robustness of the findings (laws, exponents) and insights (data dependence and model scale invariance). 2. The work is mostly a benchmarking study with main contribution relying on the hypothesis constructed and how the experiment for it is setup, which therefore leaves more room for explaining some of the design choices, especially with model scale, hyperparameters (see, Questions below for examples). 3. Section 3.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Resource Development and Performance Evaluation
