Theoretical Analysis of Auto Rate-Tuning by Batch Normalization
Sanjeev Arora, Zhiyuan Li, Kaifeng Lyu

TL;DR
This paper provides a theoretical analysis demonstrating that Batch Normalization enables gradient descent to succeed with less tuning of learning rates, showing convergence rates for both deterministic and stochastic methods.
Contribution
It offers the first theoretical support for BN's conjectured property of reducing the need for learning rate tuning in gradient-based optimization.
Findings
Gradient descent with fixed learning rate approaches stationarity at rate T^{-1/2}.
Stochastic gradient descent converges at rate T^{-1/4}.
Supports BN's role in simplifying hyperparameter tuning.
Abstract
Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, ), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of in iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate is also shown for stochastic gradient descent.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Advanced Neural Network Applications
