On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

Moritz Haas; Sebastian Bordt; Ulrike von Luxburg; Leena Chennuru Vankadara

arXiv:2505.22491·cs.LG·October 28, 2025

On the Surprising Effectiveness of Large Learning Rates under Standard Width Scaling

Moritz Haas, Sebastian Bordt, Ulrike von Luxburg, Leena Chennuru Vankadara

PDF

Open Access

TL;DR

This paper reveals that large learning rates can be effective in standard-width neural networks due to a controlled divergence regime, challenging traditional infinite-width theory predictions and offering new insights into optimal training dynamics.

Contribution

The study introduces a refined analysis of unstable regimes, identifying a controlled divergence regime where features evolve at large learning rates, and demonstrates its relevance across various architectures and datasets.

Findings

01

Neural networks operate in a controlled divergence regime under CE loss.

02

Width-scaling helps predict maximal stable learning rate exponents.

03

Layerwise learning rate scaling has limitations explained by the new analysis.

Abstract

Scaling limits, such as infinite-width limits, serve as promising theoretical tools to study large-scale models. However, it is widely believed that existing infinite-width theory does not faithfully explain the behavior of practical networks, especially those trained in standard parameterization (SP) meaning He initialization with a global learning rate. For instance, existing theory for SP predicts instability at large learning rates and vanishing feature learning at stable ones. In practice, however, optimal learning rates decay slower than theoretically predicted and networks exhibit both stable training and non-trivial feature learning, even at very large widths. Here, we show that this discrepancy is not fully explained by finite-width phenomena. Instead, we find a resolution through a finer-grained analysis of the regime previously considered unstable and therefore…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Domain Adaptation and Few-Shot Learning