AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Tim Tsz-Kit Lau; Han Liu; Mladen Kolar

arXiv:2402.11215·cs.LG·May 29, 2024·3 cites

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Tim Tsz-Kit Lau, Han Liu, Mladen Kolar

PDF

Open Access

TL;DR

This paper introduces AdAdaGrad, an adaptive batch size scheme for gradient methods that improves training efficiency and generalization in large-scale deep learning by dynamically adjusting batch sizes during training.

Contribution

We propose AdAdaGrad and AdAdaGradNorm, novel adaptive batch size strategies integrated with AdaGrad, with proven convergence and demonstrated benefits in training efficiency and model generalization.

Findings

01

AdAdaGrad converges at a rate of O(1/K) for nonconvex functions.

02

Adaptive batch sizing improves training efficiency.

03

Experimental results show enhanced generalization performance.

Abstract

The choice of batch sizes in minibatch stochastic gradient optimizers is critical in large-scale model training for both optimization and generalization performance. Although large-batch training is arguably the dominant training paradigm for large-scale deep learning due to hardware advances, the generalization performance of the model deteriorates compared to small-batch training, leading to the so-called "generalization gap" phenomenon. To mitigate this, we investigate adaptive batch size strategies derived from adaptive sampling methods, originally developed only for stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce AdAdaGrad and its scalar variant AdAdaGradNorm, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Advanced Bandit Algorithms Research

MethodsAdaGrad