Correlations Are Ruining Your Gradient Descent
Nasir Ahmad

TL;DR
This paper discusses how data correlations hinder gradient descent in neural networks and introduces decorrelation methods that improve training speed and accuracy, especially for approximate and distributed learning systems.
Contribution
It extends natural gradient insights to show the importance of decorrelation at each neural layer and proposes a novel decorrelation method enhancing backpropagation and its approximations.
Findings
Decorrelating layer outputs accelerates training.
Improved accuracy and convergence of approximate backpropagation.
Potential applications in neuromorphic hardware and neuroscience.
Abstract
Herein the topics of (natural) gradient descent, data decorrelation, and approximate methods for backpropagation are brought into a common discussion. Natural gradient descent illuminates how gradient vectors, pointing at directions of steepest descent, can be improved by considering the local curvature of loss landscapes. We extend this perspective and show that to fully solve the problem illuminated by natural gradients in neural networks, one must recognise that correlations in the data at any linear transformation, including node responses at every layer of a neural network, cause a non-orthonormal relationship between the model's parameters. To solve this requires a method for decorrelating inputs at each individual layer of a neural network. We describe a range of methods which have been proposed for decorrelation and whitening of node output, and expand on these to provide a…
Peer Reviews
Decision·Submitted to ICLR 2025
- The focus on the input correlation part of natural gradient descent is novel, as is the proposed decorrelation mechanism. - The observation that this improves performance for BP alternatives is very significant, since these methods have the potential to be more efficient for training - The narrative exposition was easy to follow, and the concepts are explained very clearly - The (almost) empirical evaluation supports the conclusions of the paper - connections to biology is interesting
- The discussion of methods from computational neuroscience is very cursory, and would benefit from more details such as forms of specific learning rules - The recurrence aspect of the decorrelation rule could be discussed in more depth in the main text. - It's not clear if the proposed decorrelation method only has a recurrent formulation, or can be efficiently implemented without recurrence as well.
• The initial sections of the paper provide an insightful overview of Natural Gradient Descent and offer a compelling perspective on how Amari’s theoretical framework can be contextualized within deep learning. While typical applications of Natural Gradients focus on Bayesian inference, where the Fisher information naturally defines the metric, the authors present an innovative interpretation that broadens its relevance. • The simulations included in the study convincingly demonstrate that impl
The primary issue with this paper is its limited coherence and lack of substantial innovation. The manuscript presents small incremental contributions across several topics and attempts to integrate them into a unified framework. This approach results in a paper that straddles the line between a review and an opinion piece, which does not align well with the expectations for an ICLR submission. Below, I outline the specific areas where the paper falls short in terms of novelty: 1. **Review of Na
I believe this is an original synthesis of concepts resulting in an original and significant modification to approximate gradient methods. This paper is extremely well written, and generally of high clarity and quality throughout.
- It is not clear that the alignment to natural gradients is what is underlying the improved performance. - No assessment of how successful the decorrelation updates are in aligning regular updates to the natural updates. How big is the contribution of gradient correlations? - While the improvements in FA and NP are striking, the experimental results are not finely tuned. It is possible with GD to obtain better results than those shown, and it is unknown if FA and NP will match this improvemen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices
MethodsNatural Gradient Descent
