Rethinking Neural Network Learning Rates: A Stackelberg Perspective
Sihan Zeng, Sujay Bhatt, Sumitra Ganesh

TL;DR
This paper offers a Stackelberg optimization perspective on neural network training, revealing how non-uniform learning rates can accelerate convergence and improve performance by leveraging problem structure and curvature differences.
Contribution
It introduces a Stackelberg reformulation of neural network training, providing convergence guarantees and explaining when and why layer-specific learning rates are beneficial.
Findings
Non-uniform learning rates can induce a stronger optimization structure.
Stackelberg objective exhibits sharper local curvature early in training.
Experiments confirm improved training speed and performance.
Abstract
Neural networks are typically trained with a single learning rate across all layers. While recent empirical evidence suggests that assigning layer-specific learning rates can accelerate training, a principled understanding of the conditions and mechanisms under which non-uniform learning rates are beneficial remains limited. In this work, we investigate non-uniform learning rates through the lens of Stackelberg optimization. Specifically, we demonstrate that training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation of the original objective. We establish finite-time convergence guarantees for the algorithm under broad conditions that accommodate constraint sets and non-smooth activation functions. Beyond…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
