Does SGD really happen in tiny subspaces?

Minhak Song; Kwangjun Ahn; Chulhee Yun

arXiv:2405.16002·cs.LG·March 12, 2025

Does SGD really happen in tiny subspaces?

Minhak Song, Kwangjun Ahn, Chulhee Yun

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper investigates whether training neural networks within the dominant subspace of the loss Hessian is feasible, finding that such projections do not hinder training and that the observed alignment is likely spurious, challenging previous assumptions.

Contribution

The study demonstrates that projecting SGD updates onto the dominant subspace does not improve training, revealing that the alignment with this subspace is misleading and not essential for effective training.

Findings

01

Projected updates onto the dominant subspace do not decrease loss.

02

Removing the dominant subspace component does not impair training.

03

Alignment with the dominant subspace is likely spurious and not causally beneficial.

Abstract

Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural networks can be trained within the dominant subspace, which, if feasible, could lead to more efficient training methods. Our primary observation is that when the SGD update is projected onto the dominant subspace, the training loss does not decrease further. This suggests that the observed alignment between the gradient and the dominant subspace is spurious. Surprisingly, projecting out the dominant subspace proves to be just as effective as the original update, despite removing the majority of…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

This paper systematically investigates the phenomenon of gradient-Hessian alignment in various optimization algorithms, including SGD, GD in the Edge of Stability (EoS) regime, and the Sharpness-Aware Minimization (SAM) algorithm, with a particular emphasis on the analysis of SGD. Previous studies have primarily focused on understanding full-batch algorithms like GD or Adaptive GD without stochasticity. Moreover, this work takes an initial step toward understanding the 'spurious' alignment whe

Weaknesses

First, the so-called 'spurious alignment' is long observed in EoS literature in my opinion. For example, Damian et al. [2] showed that in the EoS regime, (1) the gradient alignment happens (2) the loss decrement in the EoS regime depends on the constrained trajectory **by projecting out the top eigenspace**. It is exactly the finding listed in section 5.1 of this paper that bulk-GD is as effective as GD. I believe the authors should discuss those related works. The author may argue that "The me

Reviewer 02Rating 3Confidence 5

Strengths

Many recent papers (Blanc et al., 2022; Li et al., 2022) highlight that the dynamics in bulk subspace (i.e., along flat directions) are crucial for SGD/SAM to move to flat minima, thereby improving generalization. In contrast, this paper emphasizes the significant role of the dynamics in bulk subspace (i.e., along flat directions) in relation to optimization.

Weaknesses

My primary concern are (i) the paper does not sufficiently explain the main finding, i.e., why only the dynamics in the bulk subspace are crucial for optimization, and (ii) the novelty of many contexts: - Section 4. This section focus on the alignment between the stochastic gradient and the sharp directions. However, - This section fails to adequately explain the main finding: why only the dynamics in the bulk subspace are crucial for optimization, expect for a very toy model. - Even rega

Reviewer 03Rating 8Confidence 4

Strengths

The paper addresses a question of significant interest in the ML community regarding SGD's effectiveness in low-dimensional subspaces. It helps address potential misconceptions arising from the well-cited work of Gur-Ari et al. (2018), which suggests that gradient descent would occur primarily within a tiny subspace. A convincing quadratic toy model effectively reinforces the authors’ interpretation of the empirical results, lending credibility to their main conclusions. The paper is well-stru

Weaknesses

Although the paper includes experiments on three datasets, training is restricted to small subsets (e.g., 5,000 of 50,000 samples of CIFAR10) and primarily uses mean squared error loss instead of cross-entropy, despite focusing on classification tasks. This may limit the generalizability of the findings and should be communicated more clearly. The paper exclusively examines the effects on training loss, with no analysis of test accuracy under Dom-SGD and Bulk-SGD. Including at least one plot of

Videos

Does SGD really happen in tiny subspaces?· slideslive

Taxonomy

TopicsSystemic Sclerosis and Related Diseases

MethodsSharpness-Aware Minimization · Stochastic Gradient Descent