Diagonal Rescaling For Neural Networks
Jean Lafond, Nicolas Vasilache, L\'eon Bottou

TL;DR
This paper introduces a second-order stochastic gradient algorithm for neural networks that normalizes activations and offers new insights into stepsize scaling and curvature management, enhancing training robustness.
Contribution
It proposes a novel second-order algorithm with a block-diagonal structure that normalizes activations and provides new understanding of stepsize scaling and curvature adaptation.
Findings
Clarifies the role of stepsize scaling in popular algorithms
Highlights the importance of handling rapid curvature changes
Connects old tricks with modern normalization techniques
Abstract
We define a second-order neural network stochastic gradient training algorithm whose block-diagonal structure effectively amounts to normalizing the unit activations. Investigating why this algorithm lacks in robustness then reveals two interesting insights. The first insight suggests a new way to scale the stepsizes, clarifying popular algorithms such as RMSProp as well as old neural network tricks such as fanin stepsize scaling. The second insight stresses the practical importance of dealing with fast changes of the curvature of the cost.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Machine Learning and Algorithms
MethodsRMSProp
