TL;DR
This paper investigates how preconditioned gradient descent (PGD) influences spectral bias and the grokking phenomenon in neural networks, combining theoretical insights and experiments to show PGD can facilitate transition to the rich learning regime.
Contribution
It provides a theoretical and empirical analysis of PGD's role in mitigating spectral bias and accelerating grokking, revealing its potential to promote the transition from NTK to rich regimes.
Findings
PGD reduces spectral bias effects in neural network training.
Experimental results confirm PGD accelerates grokking and transition to the rich regime.
Grokking is characterized as a transition between lazy and rich learning regimes.
Abstract
Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific tasks that require capturing fine-scale structures. The delayed generalization phenomenon known as grokking is another barrier to rapid training of neural networks. Grokking has been hypothesized to arise as learning transitions from the NTK to the feature-rich regime. This paper explores the impact of preconditioned gradient descent (PGD), such as Gauss-Newton, on spectral bias and grokking phenomena. We demonstrate through theoretical and empirical results how PGD can mitigate issues associated with spectral bias. Additionally, building on the rich learning regime grokking hypothesis, we study how PGD can be used to reduce delays associated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
