How to Set the Batch Size for Large-Scale Pre-training?

Yunhua Zhou; Junhao Huang; Shuhao Xing; Yechen Zhang; Runyu Peng; Qiping Guo; Xipeng Qiu

arXiv:2601.05034·cs.AI·January 12, 2026

How to Set the Batch Size for Large-Scale Pre-training?

Yunhua Zhou, Junhao Huang, Shuhao Xing, Yechen Zhang, Runyu Peng, Qiping Guo, Xipeng Qiu

PDF

Open Access

TL;DR

This paper revises the theoretical understanding of batch size in large-scale pre-training under the WSD scheduler, proposing a dynamic batch size strategy that improves training efficiency and model quality.

Contribution

It introduces a new E(S) relationship for WSD schedulers, identifying B_min and B_opt, and proposes a dynamic batch size scheduler to optimize large-scale pre-training.

Findings

01

Revised formula accurately models pre-training dynamics.

02

Dynamic scheduler improves training efficiency.

03

Enhanced final model quality.

Abstract

The concept of Critical Batch Size, as pioneered by OpenAI, has long served as a foundational principle for large-scale pre-training. However, with the paradigm shift towards the Warmup-Stable-Decay (WSD) learning rate scheduler, we observe that the original theoretical framework and its underlying mechanisms fail to align with new pre-training dynamics. To bridge this gap between theory and practice, this paper derives a revised E(S) relationship tailored for WSD scheduler, characterizing the trade-off between training data consumption E and steps S during pre-training. Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens. Building upon these properties, we propose a dynamic Batch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · Data Stream Mining Techniques