Does SGD really happen in tiny subspaces?
Minhak Song, Kwangjun Ahn, Chulhee Yun

TL;DR
This paper investigates whether training neural networks within the dominant subspace of the loss Hessian is feasible, finding that such projections do not hinder training and that the observed alignment is likely spurious, challenging previous assumptions.
Contribution
The study demonstrates that projecting SGD updates onto the dominant subspace does not improve training, revealing that the alignment with this subspace is misleading and not essential for effective training.
Findings
Projected updates onto the dominant subspace do not decrease loss.
Removing the dominant subspace component does not impair training.
Alignment with the dominant subspace is likely spurious and not causally beneficial.
Abstract
Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural networks can be trained within the dominant subspace, which, if feasible, could lead to more efficient training methods. Our primary observation is that when the SGD update is projected onto the dominant subspace, the training loss does not decrease further. This suggests that the observed alignment between the gradient and the dominant subspace is spurious. Surprisingly, projecting out the dominant subspace proves to be just as effective as the original update, despite removing the majority of…
Peer Reviews
Decision·ICLR 2025 Poster
This paper systematically investigates the phenomenon of gradient-Hessian alignment in various optimization algorithms, including SGD, GD in the Edge of Stability (EoS) regime, and the Sharpness-Aware Minimization (SAM) algorithm, with a particular emphasis on the analysis of SGD. Previous studies have primarily focused on understanding full-batch algorithms like GD or Adaptive GD without stochasticity. Moreover, this work takes an initial step toward understanding the 'spurious' alignment whe
First, the so-called 'spurious alignment' is long observed in EoS literature in my opinion. For example, Damian et al. [2] showed that in the EoS regime, (1) the gradient alignment happens (2) the loss decrement in the EoS regime depends on the constrained trajectory **by projecting out the top eigenspace**. It is exactly the finding listed in section 5.1 of this paper that bulk-GD is as effective as GD. I believe the authors should discuss those related works. The author may argue that "The me
Many recent papers (Blanc et al., 2022; Li et al., 2022) highlight that the dynamics in bulk subspace (i.e., along flat directions) are crucial for SGD/SAM to move to flat minima, thereby improving generalization. In contrast, this paper emphasizes the significant role of the dynamics in bulk subspace (i.e., along flat directions) in relation to optimization.
My primary concern are (i) the paper does not sufficiently explain the main finding, i.e., why only the dynamics in the bulk subspace are crucial for optimization, and (ii) the novelty of many contexts: - Section 4. This section focus on the alignment between the stochastic gradient and the sharp directions. However, - This section fails to adequately explain the main finding: why only the dynamics in the bulk subspace are crucial for optimization, expect for a very toy model. - Even rega
The paper addresses a question of significant interest in the ML community regarding SGD's effectiveness in low-dimensional subspaces. It helps address potential misconceptions arising from the well-cited work of Gur-Ari et al. (2018), which suggests that gradient descent would occur primarily within a tiny subspace. A convincing quadratic toy model effectively reinforces the authors’ interpretation of the empirical results, lending credibility to their main conclusions. The paper is well-stru
Although the paper includes experiments on three datasets, training is restricted to small subsets (e.g., 5,000 of 50,000 samples of CIFAR10) and primarily uses mean squared error loss instead of cross-entropy, despite focusing on classification tasks. This may limit the generalizability of the findings and should be communicated more clearly. The paper exclusively examines the effects on training loss, with no analysis of test accuracy under Dom-SGD and Bulk-SGD. Including at least one plot of
Videos
Taxonomy
TopicsSystemic Sclerosis and Related Diseases
MethodsSharpness-Aware Minimization · Stochastic Gradient Descent
