Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Minimization of Stochastic First-order Oracle Complexity

Hikaru Umeda; Hideaki Iiduka

arXiv:2508.05302·cs.LG·August 8, 2025

Adaptive Batch Size and Learning Rate Scheduler for Stochastic Gradient Descent Based on Minimization of Stochastic First-order Oracle Complexity

Hikaru Umeda, Hideaki Iiduka

PDF

TL;DR

This paper proposes an adaptive scheduling strategy for mini-batch SGD that dynamically adjusts batch size and learning rate based on theoretical insights, leading to faster convergence in training neural networks.

Contribution

It introduces a novel adaptive scheduler for SGD that leverages the critical batch size concept to improve convergence speed.

Findings

01

Adaptive scheduler outperforms existing methods in convergence speed.

02

Adjusting batch size and learning rate based on gradient decay enhances training efficiency.

03

Theoretical analysis supports the effectiveness of the proposed strategy.

Abstract

The convergence behavior of mini-batch stochastic gradient descent (SGD) is highly sensitive to the batch size and learning rate settings. Recent theoretical studies have identified the existence of a critical batch size that minimizes stochastic first-order oracle (SFO) complexity, defined as the expected number of gradient evaluations required to reach a stationary point of the empirical loss function in a deep neural network. An adaptive scheduling strategy is introduced to accelerate SGD that leverages theoretical findings on the critical batch size. The batch size and learning rate are adjusted on the basis of the observed decay in the full gradient norm during training. Experiments using an adaptive joint scheduler based on this strategy demonstrated improved convergence speed compared with that of existing schedulers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.