Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

Jinbo Wang; Binghui Li; Zhanpeng Zhou; Mingze Wang; Yuxuan Sun; Jiaqi Zhang; Xunliang Cai; Lei Wu

arXiv:2602.14208·cs.LG·February 24, 2026

Fast Catch-Up, Late Switching: Optimal Batch Size Scheduling via Functional Scaling Laws

Jinbo Wang, Binghui Li, Zhanpeng Zhou, Mingze Wang, Yuxuan Sun, Jiaqi Zhang, Xunliang Cai, Lei Wu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a theoretical framework based on functional scaling laws to optimize batch size schedules in large-scale deep learning, revealing task-dependent strategies and a late-switching phenomenon that improves efficiency.

Contribution

It provides a principled analysis of batch size scheduling using FSL, characterizes optimal schedules based on task difficulty, and uncovers the late-switching mechanism validated by extensive experiments.

Findings

01

Late switching improves training efficiency across models.

02

Large batches can be deferred to late training without performance loss.

03

The fast catch-up effect explains the late-switching advantage.

Abstract

Batch size scheduling (BSS) plays a critical role in large-scale deep learning training, influencing both optimization dynamics and computational efficiency. Yet, its theoretical foundations remain poorly understood. In this work, we show that the functional scaling law (FSL) framework introduced in Li et al. (2025a) provides a principled lens for analyzing BSS. Specifically, we characterize the optimal BSS under a fixed data budget and show that its structure depends sharply on task difficulty. For easy tasks, optimal schedules keep increasing batch size throughout. In contrast, for hard tasks, the optimal schedule maintains small batch sizes for most of training and switches to large batches only in a late stage. To explain the emergence of late switching, we uncover a dynamical mechanism -- the fast catch-up effect -- which also manifests in large language model (LLM) pretraining.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

See below.

Weaknesses

See below.

Reviewer 02Rating 6Confidence 4

Strengths

1. Clear empirical phenomena distilled. The paper cleanly isolates and names two behaviors (“sudden drop,” “final merge”) and shows them across architectures/scales, which aids practitioner understanding. 2. Theory that matches practice. The FSL-based analysis explains both phenomena and yields a concrete later-switch rule and a power-law prediction for the optimal switch point, predictions borne out in experiments (including a strong log–log fit). 3. Breadth of evidence. Results span dense (LLa

Weaknesses

1. Theory and most experiments assume constant LR, while real LLM training typically uses warmup + cosine/linear decay. 2. Large-scale runs use a private dataset, which limits external reproducibility.

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper studies batch size scheduling from a theoretical perspective of a power-law regression model and provides insights related to ‘sudden drop’ and ‘final merge’ of the loss values. 2. It proposes scaling law for optimal switching time from small to large batch in a training run and also empirically verifies that practical settings obey a scaling law. 3. It also proposes an optimal batch size scheduling algorithm for the power-law model.

Weaknesses

1. The paper only studies a constant learning rate schedule, which deviates from practice. 2. Although the paper proposes an optimal batch size scheduling algorithm as a power law, it provides no way of actually developing a practical optimal scheduling algorithm. 3. I don't think Lemma 3 holds for any arbitrary $\theta$, but only for local minimizers as the expected gradient has to be zero for this to hold.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques