Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

William Merrill; Shane Arora; Dirk Groeneveld; Hannaneh Hajishirzi

arXiv:2505.23971·cs.LG·November 7, 2025

Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training

William Merrill, Shane Arora, Dirk Groeneveld, Hannaneh Hajishirzi

PDF

Open Access 1 Video

TL;DR

This paper introduces an empirical method to measure the critical batch size during language model training, showing how it evolves and can inform batch size warmup strategies to improve training efficiency and scalability.

Contribution

The authors propose a simple empirical approach to measure the critical batch size dynamically during training, demonstrating its evolution and practical application in large-scale language model training.

Findings

01

Critical batch size starts near zero at initialization

02

It increases rapidly and then plateaus during training

03

Batch size warmup based on CBS improves training efficiency

Abstract

The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To navigate this tradeoff, McCandlish et al. (2018) suggest that a critical batch size (CBS), below which training will not substantially degrade loss, can be estimated based on the gradient noise scale during training. While their method has been adopted in practice, e.g., when training GPT-3, strong assumptions are required to justify gradient noise as a proxy for the CBS, which makes it unclear whether their approach should be trusted in practice, limiting its applicability. In this paper, we introduce a simple, empirical approach to directly measure the CBS and show how the CBS evolves over training. Applying our approach to the OLMo models, we find that CBS is near 0 at initialization, increases…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Cosine Annealing · Multi-Head Attention · Byte Pair Encoding · Attention Is All You Need · {Dispute@FaQ-s}How to file a dispute with Expedia? · Dropout · Residual Connection · 15 Ways to Contact How can i speak to someone at Delta Airlines