The Optimization Landscape of SGD Across the Feature Learning Strength
Alexander Atanasov, Alexandru Meterez, James B. Simon, Cengiz Pehlevan

TL;DR
This paper empirically investigates how scaling the feature learning strength parameter $oldsymbol{ extgamma}$ in neural networks affects training dynamics, optimal learning rates, and performance, revealing regimes with distinct behaviors and potential for improved understanding of representation learning.
Contribution
The study provides a comprehensive empirical analysis of the $oldsymbol{ extgamma}$ scaling effect across models and datasets, and offers theoretical insights into optimal learning rate scaling in different regimes.
Findings
Optimal learning rate scales as $oldsymbol{ exteta^* ightarrow oldsymbol{ extgamma^2}}$ for small $oldsymbol{ extgamma}$.
Networks with large $oldsymbol{ extgamma}$ exhibit characteristic loss curves with long plateaus and staircase drops.
Large $oldsymbol{ extgamma}$ regimes often yield better online performance if hyperparameters are properly tuned.
Abstract
We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter . Recent work has identified as controlling the strength of feature learning. As increases, network evolution changes from "lazy" kernel dynamics to "rich" feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling across a variety of models and datasets in the online training setting. We first examine the interaction of with the learning rate , identifying several scaling regimes in the - plane which we explain theoretically using a simple model. We find that the optimal learning rate scales non-trivially with . In particular, when and…
Peer Reviews
Decision·ICLR 2025 Poster
The paper provides extensive empirical study for the scaling law of $\eta$ (learning rate) and $\gamma$ (feature learning strength), and also provided an analytical explanation using deep linear network that matches with the empirical results. **Originality** 1. The paper investigates the scaling relationship between feature learning strength $\gamma$ and learning rate $\eta$ in both lazy, rich, and ultra-rich regime. To my knowledge, I haven't seen work that extensively address this topic. **
While the paper addressed a significant topic in the feature learning area and provided reasonable empirical and theoretical results, I think the clarity of the paper still have lots of room for improvement to help the readers have an easier time to read the paper. Specifically, the authors can consider to try to re-organize the paper flow to focus on the key points: Since the key results of the paper is the scaling relationship between feature learning strength $\gamma$ and learning rate $\eta$
1. This paper is well-written and the logic is easy to follow. The experiments are nicely done and the theoretical results are well presented. 2. There are some interesting empirical observations. For example, they observe that in the feature learning regime i.e., $\gamma \geq 1$, the generalization performance will end up being similar with different $\gamma$, if the learning rate is appropriately chosen. I also find the connection between silent alignment and edge os stability quite interest
1. One concern is the significance of this paper. Feature learning is an interesting topic, but I believe that this kind of parameterization is not commonly used in practice, i.e., large $\gamma$ with the model being centered. Furthermore, the learning rate used in this paper is constant hence is far from practice. Therefore, I am not sure if the insights observed in this paper can shed light on practical network training. 2. This paper may provide valuable insights for theoretical analysis. Ho
1. The experiments are conducted systematically, making the results highly convincing. The paper is also well-written and clear. 2. The observation that networks with a larger $\gamma$ can often achieve the same or even better performance than those with $\gamma=1$ after sufficient training time is intriguing. This suggests a potential new technique to improve performance in practice. It's also interesting to see that networks with large $\gamma$ exhibit very similar training dynamics after an
1. I am somewhat unclear on how these results translate to practical applications. While the empirical findings in this paper are robust, the setup does not precisely match real-world scenarios (e.g. the architectures are different). I wonder what insights could be derived if we consider a standard practical setup. 2. I believe the paper would benefit from a more detailed and mathematically formulated introduction to the networks’ structure and parametrization.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
