On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling
Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, Leena Chennuru Vankadara

TL;DR
This paper reveals that large learning rates can be effective in standard-width neural networks due to a controlled divergence regime, challenging traditional infinite-width theory predictions and offering new insights into optimal training dynamics.
Contribution
The study introduces a refined analysis of unstable regimes, identifying a controlled divergence regime where features evolve at large learning rates, and demonstrates its relevance across various architectures and datasets.
Findings
Neural networks operate in a controlled divergence regime under CE loss.
Width-scaling helps predict maximal stable learning rate exponents.
Layerwise learning rate scaling has limitations explained by the new analysis.
Abstract
Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully explain the behavior of practical networks, especially those trained in standard parameterization (SP) meaning He initialization with a global learning rate. For instance, existing theory for SP predicts instability at large learning rates and vanishing feature learning at stable ones. In practice, however, optimal learning rates decay slower than theoretically predicted and networks exhibit both stable training and non-trivial feature learning, even at very large widths. Here, we show that this discrepancy is not fully explained by finite-width phenomena. Instead, we find a resolution through a finer-grained analysis of the regime previously considered unstable and therefore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Domain Adaptation and Few-Shot Learning
