Understanding Gradient Descent through the Training Jacobian

Nora Belrose; Adam Scherlis

arXiv:2412.07003·cs.LG·December 12, 2024

Understanding Gradient Descent through the Training Jacobian

Nora Belrose, Adam Scherlis

PDF

Open Access 1 Repo 5 Reviews

TL;DR

This paper investigates the geometry of neural network training using the Jacobian matrix, revealing low-dimensional, data-dependent structures and spectral properties that influence how perturbations affect network outputs, especially out-of-distribution.

Contribution

It introduces a detailed analysis of the Jacobian's spectral structure during training, highlighting its low-dimensional nature and implications for model robustness and initialization effects.

Findings

01

Jacobian spectrum has three regions: chaotic, bulk, and stable.

02

Perturbations along bulk directions are carried through training unchanged.

03

Perturbations have minimal effect on in-distribution outputs but impact out-of-distribution predictions.

Abstract

We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values. Our analysis reveals low-dimensional structure in the training process which is dependent on the input data but largely independent of the labels. We find that the singular value spectrum of the Jacobian matrix consists of three distinctive regions: a "chaotic" region of values orders of magnitude greater than one, a large "bulk" region of values extremely close to one, and a "stable" region of values less than one. Along each bulk direction, the left and right singular vectors are nearly identical, indicating that perturbations to the initialization are carried through training almost unchanged. These perturbations have virtually no effect on the network's output in-distribution, yet do have an effect far out-of-distribution. While the Jacobian…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 3

Strengths

- Understanding the training dynamics of deep neural networks is interesting and can have lots of potential in accelerating both training and inference of these large models. - The paper is well written and most of the figures are clear.

Weaknesses

- The biggest weakness is that the experiments are carried out on extremely simple datasets and models. Therefore it is unclear whether the conclusions being made can apply to modern deep learning architectures and datasets. I would suggest the authors either provide some theoretical guarantees or conduct experiments on more realistic models/datasets. - The other major weakness is that it is hard to understand how this work fits into the current body of literature and what new insights it bring

Reviewer 02Rating 5Confidence 3

Strengths

The paper is well written. The subject is interesting and this paper sheds light on insightful phenomena.

Weaknesses

- the code is not available - The bibliography part seems quite light (10 cited papers) - The authors need to formalize some concepts rigorously. For example, the 'dimensionality of training' is not defined, though the notion seems clear for the authors as they claim "Clearly (...) the dimensionality of training is equal to (...)" ll. 089, 090. - I did not understand the part l. 092-094, so I would like to have more precise explanations. - typos: "to the the Jacobian of" l. 240

Reviewer 03Rating 3Confidence 4

Strengths

1) The paper demonstrates empirically three different regions of the singular value spectrum of the training Jacobian around initialization. This it is heavily explored research direction and this paper provides further empirical evidence on the training characterization of shallow NNs, including random feature models and two-layer NNs with lazy training regime. 2) They further report that the structure of the input data plays a key role (largely independent of labels) aligning with prior result

Weaknesses

1) The current work addresses a specific direction of exploring the neural network training via the spectral analysis of the Jacobian of the trained parameters with respect to initialization. This regime has been already shown to perform equivalent to linear models, and the literature already focus by far on alternative methods that enable to overperform linear models. Hence, the motivation of the current work and how it interplays with rest of the literature lacks fundamental details. 2) The

Reviewer 04Rating 3Confidence 4

Strengths

The study of training through the 'training Jacobian' appears novel, with many creative experimental methodologies to extract novel relationships between this Jacobian with 'importance' of eigenvectors and with the predictions. Good presentation with interesting experiments and clear commentary. The insights drawn are computed largely from properties of the network at initialization, which can be powerful to reveal insights of model performance without costly training.

Weaknesses

The study is limited to a very small set of model/architecture/task combinations. While the demarcated subspaces of bulk, chaotic, and stable are loose, the additional phase changes observed in the experiments (e.g. Fig 3, Fig 5, Fig 6 are less clear. The empirical evidence suggests the existence of further structure, where some 'thresholds' appear to be taken to match the evidence with the 3-region characterization provided. While the insights are novel and interesting, the relevance to pract

Reviewer 05Rating 3Confidence 4

Strengths

- The question is undoubtedly an interesting one; understanding the behaviour of parameters during optimisation is important for designing better methods. - The authors present their findings in a very clear and engaging way.

Weaknesses

I am unfortunately not convinced of the strength of the findings. My major concerns are as follows: - There appears to be quite a simple explanation for the main findings. The authors themselves show that the 'parameter function Jacobian' has a very similar structure in its singular values, and it seems entirely to be expected that the Jacobian structure will manifest itself during gradient descent. If the networks predictions are indifferent to parameter changes in a given direction (as measure

Code & Models

Repositories

eleutherai/training-jacobian
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Resource Development and Performance Evaluation