Correlations Are Ruining Your Gradient Descent

Nasir Ahmad

arXiv:2407.10780·cs.LG·August 26, 2025

Correlations Are Ruining Your Gradient Descent

Nasir Ahmad

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper discusses how data correlations hinder gradient descent in neural networks and introduces decorrelation methods that improve training speed and accuracy, especially for approximate and distributed learning systems.

Contribution

It extends natural gradient insights to show the importance of decorrelation at each neural layer and proposes a novel decorrelation method enhancing backpropagation and its approximations.

Findings

01

Decorrelating layer outputs accelerates training.

02

Improved accuracy and convergence of approximate backpropagation.

03

Potential applications in neuromorphic hardware and neuroscience.

Abstract

Herein the topics of (natural) gradient descent, data decorrelation, and approximate methods for backpropagation are brought into a common discussion. Natural gradient descent illuminates how gradient vectors, pointing at directions of steepest descent, can be improved by considering the local curvature of loss landscapes. We extend this perspective and show that to fully solve the problem illuminated by natural gradients in neural networks, one must recognise that correlations in the data at any linear transformation, including node responses at every layer of a neural network, cause a non-orthonormal relationship between the model's parameters. To solve this requires a method for decorrelating inputs at each individual layer of a neural network. We describe a range of methods which have been proposed for decorrelation and whitening of node output, and expand on these to provide a…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 4

Strengths

- The focus on the input correlation part of natural gradient descent is novel, as is the proposed decorrelation mechanism. - The observation that this improves performance for BP alternatives is very significant, since these methods have the potential to be more efficient for training - The narrative exposition was easy to follow, and the concepts are explained very clearly - The (almost) empirical evaluation supports the conclusions of the paper - connections to biology is interesting

Weaknesses

- The discussion of methods from computational neuroscience is very cursory, and would benefit from more details such as forms of specific learning rules - The recurrence aspect of the decorrelation rule could be discussed in more depth in the main text. - It's not clear if the proposed decorrelation method only has a recurrent formulation, or can be efficiently implemented without recurrence as well.

Reviewer 02Rating 3Confidence 3

Strengths

• The initial sections of the paper provide an insightful overview of Natural Gradient Descent and offer a compelling perspective on how Amari’s theoretical framework can be contextualized within deep learning. While typical applications of Natural Gradients focus on Bayesian inference, where the Fisher information naturally defines the metric, the authors present an innovative interpretation that broadens its relevance. • The simulations included in the study convincingly demonstrate that impl

Weaknesses

The primary issue with this paper is its limited coherence and lack of substantial innovation. The manuscript presents small incremental contributions across several topics and attempts to integrate them into a unified framework. This approach results in a paper that straddles the line between a review and an opinion piece, which does not align well with the expectations for an ICLR submission. Below, I outline the specific areas where the paper falls short in terms of novelty: 1. **Review of Na

Reviewer 03Rating 5Confidence 3

Strengths

I believe this is an original synthesis of concepts resulting in an original and significant modification to approximate gradient methods. This paper is extremely well written, and generally of high clarity and quality throughout.

Weaknesses

- It is not clear that the alignment to natural gradients is what is underlying the improved performance. - No assessment of how successful the decorrelation updates are in aligning regular updates to the natural updates. How big is the contribution of gradient correlations? - While the improvements in FA and NP are striking, the experimental results are not finely tuned. It is possible with GD to obtain better results than those shown, and it is unknown if FA and NP will match this improvemen

Code & Models

Repositories

nasiryahm/correlationsruingd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices

MethodsNatural Gradient Descent