Controlling Grokking with Nonlinearity and Data Symmetry
Ahmed Salah, David Yevick

TL;DR
This paper explores how adjusting nonlinearity and data symmetry in neural networks influences grokking behavior, enabling control over generalization and revealing patterns useful for factoring composite moduli.
Contribution
It introduces methods to control grokking through activation functions and network architecture, and links weight entropy and nonlinearity to generalization and data symmetry.
Findings
Increasing nonlinearity leads to more uniform PCA weight patterns.
Patterns in weight projections can be used to factor nonprime P.
Weight entropy correlates with the network's generalization ability.
Abstract
This paper demonstrates that grokking behavior in modular arithmetic with a modulus P in a neural network can be controlled by modifying the profile of the activation function as well as the depth and width of the model. Plotting the even PCA projections of the weights of the last NN layer against their odd projections further yields patterns which become significantly more uniform when the nonlinearity is increased by incrementing the number of layers. These patterns can be employed to factor P when P is nonprime. Finally, a metric for the generalization ability of the network is inferred from the entropy of the layer weights while the degree of nonlinearity is related to correlations between the local entropy of the weights of the neurons in the final layer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsPrincipal Components Analysis
