Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws
Jinbo Wang, Binghui Li, Zhanpeng Zhou, Mingze Wang, Yuxuan Sun, Jiaqi Zhang, Xunliang Cai, Lei Wu

TL;DR
This paper introduces a theoretical framework based on functional scaling laws to optimize batch size schedules in large-scale deep learning, revealing task-dependent strategies and a late-switching phenomenon that improves efficiency.
Contribution
It provides a principled analysis of batch size scheduling using FSL, characterizes optimal schedules based on task difficulty, and uncovers the late-switching mechanism validated by extensive experiments.
Findings
Late switching improves training efficiency across models.
Large batches can be deferred to late training without performance loss.
The fast catch-up effect explains the late-switching advantage.
Abstract
Batch size scheduling (BSS) plays a critical role in large-scale deep learning training, influencing both optimization dynamics and computational efficiency. Yet, its theoretical foundations remain poorly understood. In this work, we show that the functional scaling law (FSL) framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS. Specifically, we characterize the optimal BSS under a fixed data budget and show that its structure depends sharply on task difficulty. For easy tasks, optimal schedules keep increasing batch size throughout. In contrast, for hard tasks, the optimal schedule maintains small batch sizes for most of training and switches to large batches only in a late stage. To explain the emergence of late switching, we uncover a dynamical mechanism -- the fast catch-up effect -- which also manifests in large language model (LLM) pretraining.…
Peer Reviews
Decision·ICLR 2026 Poster
See below.
See below.
1. Clear empirical phenomena distilled. The paper cleanly isolates and names two behaviors (“sudden drop,” “final merge”) and shows them across architectures/scales, which aids practitioner understanding. 2. Theory that matches practice. The FSL-based analysis explains both phenomena and yields a concrete later-switch rule and a power-law prediction for the optimal switch point, predictions borne out in experiments (including a strong log–log fit). 3. Breadth of evidence. Results span dense (LLa
1. Theory and most experiments assume constant LR, while real LLM training typically uses warmup + cosine/linear decay. 2. Large-scale runs use a private dataset, which limits external reproducibility.
1. The paper studies batch size scheduling from a theoretical perspective of a power-law regression model and provides insights related to ‘sudden drop’ and ‘final merge’ of the loss values. 2. It proposes scaling law for optimal switching time from small to large batch in a training run and also empirically verifies that practical settings obey a scaling law. 3. It also proposes an optimal batch size scheduling algorithm for the power-law model.
1. The paper only studies a constant learning rate schedule, which deviates from practice. 2. Although the paper proposes an optimal batch size scheduling algorithm as a power law, it provides no way of actually developing a practical optimal scheduling algorithm. 3. I don't think Lemma 3 holds for any arbitrary $\theta$, but only for local minimizers as the expected gradient has to be zero for this to hold.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
