Online Learning and Information Exponents: On The Importance of Batch   size, and Time/Complexity Tradeoffs

Luca Arnaboldi; Yatin Dandi; Florent Krzakala; Bruno Loureiro; Luca; Pesce; Ludovic Stephan

arXiv:2406.02157·stat.ML·September 6, 2024·1 cites

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca, Pesce, Ludovic Stephan

PDF

Open Access 1 Repo

TL;DR

This paper investigates how batch size affects training time in neural networks, identifying optimal batch sizes based on target complexity, and introduces a new protocol to surpass existing limitations, supported by theoretical and experimental validation.

Contribution

It characterizes the optimal batch size for minimizing training time based on information exponents and proposes Correlation loss SGD to improve time complexity beyond traditional limits.

Findings

01

Optimal batch size scales with input dimension and target complexity.

02

Large batch sizes beyond a threshold hinder training time improvements.

03

Correlation loss SGD effectively reduces auto-correlation, enhancing training efficiency.

Abstract

We study the impact of the batch size $n_{b}$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_{b} ≲ d^{\frac{ℓ}{2}}$ minimizes the training time without changing the total sample complexity, where $ℓ$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_{b} ≫ d^{\frac{ℓ}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IdePHICS/batch-size-time-complexity-tradeoffs
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms · Online Learning and Analytics

MethodsStochastic Gradient Descent