Step by Step: Adaptive Gradient Descent for Training L-Lipschitz Neural Networks
Kyle Sung, Kholood Khalil, Noah Forman, Steven Samu, Anastasis Kratsios

TL;DR
This paper shows that decaying learning rates in gradient descent lead to highly Lipschitz regular neural networks without sacrificing convergence, and that standard GD may inherently produce regular models.
Contribution
It introduces a theoretical framework linking learning rate decay to Lipschitz regularity and generalization, supported by empirical validation.
Findings
Decaying learning rate ensures high Lipschitz regularity.
Training with decay maintains convergence rate.
Constant step size GD yields similar regularity as decaying LR.
Abstract
We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
