Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
William Merrill, Shane Arora, Dirk Groeneveld, Hannaneh Hajishirzi

TL;DR
This paper introduces an empirical method to measure the critical batch size during language model training, showing how it evolves and can inform batch size warmup strategies to improve training efficiency and scalability.
Contribution
The authors propose a simple empirical approach to measure the critical batch size dynamically during training, demonstrating its evolution and practical application in large-scale language model training.
Findings
Critical batch size starts near zero at initialization
It increases rapidly and then plateaus during training
Batch size warmup based on CBS improves training efficiency
Abstract
The right batch size is important when training language models at scale: a large batch size is necessary for fast training, but a batch size that is too large will harm token efficiency. To navigate this tradeoff, McCandlish et al. (2018) suggest that a critical batch size (CBS), below which training will not substantially degrade loss, can be estimated based on the gradient noise scale during training. While their method has been adopted in practice, e.g., when training GPT-3, strong assumptions are required to justify gradient noise as a proxy for the CBS, which makes it unclear whether their approach should be trusted in practice, limiting its applicability. In this paper, we introduce a simple, empirical approach to directly measure the CBS and show how the CBS evolves over training. Applying our approach to the OLMo models, we find that CBS is near 0 at initialization, increases…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Cosine Annealing · Multi-Head Attention · Byte Pair Encoding · Attention Is All You Need · {Dispute@FaQ-s}How to file a dispute with Expedia? · Dropout · Residual Connection · 15 Ways to Contact How can i speak to someone at Delta Airlines
