Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning
Francois Caron, Fadhel Ayed, Paul Jung, Hoil Lee, Juho Lee, Hongseok, Yang

TL;DR
This paper studies wide, shallow neural networks with asymmetrical node scaling, proving they can learn features and converge globally, unlike traditional NTK models, with practical benefits for pruning and transfer learning.
Contribution
It introduces a novel asymmetrical node scaling approach for shallow neural networks, providing theoretical guarantees of convergence and feature learning.
Findings
Gradient flow and descent converge to global minima in large networks.
Networks with asymmetrical scaling can learn features, unlike NTK models.
Experimental results support theoretical claims and highlight benefits for pruning and transfer learning.
Abstract
We consider gradient-based optimisation of wide, shallow neural networks, where the output of each hidden node is scaled by a positive parameter. The scaling parameters are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that for large such neural networks, with high probability, gradient flow and gradient descent converge to a global minimum and can learn features in some sense, unlike in the NTK parameterisation. We perform experiments illustrating our theoretical results and discuss the benefits of such scaling in terms of prunability and transfer learning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques
MethodsPruning · Neural Tangent Kernel
