Approaching Deep Learning through the Spectral Dynamics of Weights
David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal, Vardi, Karen Livescu, Michael Maire, Matthew R. Walter

TL;DR
This paper introduces an empirical spectral dynamics framework to analyze weight singular values and vectors during training, revealing biases, regularization effects, and differences between memorization and generalization across various deep learning models.
Contribution
It provides a unified spectral perspective on deep learning phenomena, highlighting the role of weight dynamics in optimization, regularization, and network generalization.
Findings
Spectral bias is consistent across diverse tasks and models.
Weight decay influences spectral bias beyond regularization.
Spectral dynamics differentiate memorizing and generalizing networks.
Abstract
We propose an empirical approach centered on the spectral dynamics of weights -- the behavior of singular values and vectors during optimization -- to unify and clarify several phenomena in deep learning. We identify a consistent bias in optimization across various experiments, from small-scale ``grokking'' to large-scale tasks like image classification with ConvNets, image generation with UNets, speech recognition with LSTMs, and language modeling with Transformers. We also demonstrate that weight decay enhances this bias beyond its role as a norm regularizer, even in practical systems. Moreover, we show that these spectral dynamics distinguish memorizing networks from generalizing ones, offering a novel perspective on this longstanding conundrum. Additionally, we leverage spectral dynamics to explore the emergence of well-performing sparse subnetworks (lottery tickets) and the…
Peer Reviews
Decision·Submitted to ICLR 2025
1) This paper is clearly written and the idea is easy to follow. 2) The authors connect simple settings (grokking) to complex ones (LSTMs, transformers), and explore tasks from different domains including image, speech and language processing tasks. This shows that the rank minimization is a common property for people to understand deep neural networks.
As this paper is mainly empirical, it would be better to see more results of different tasks in each domain to give more evidence for the observed properties, as one task in each domain might not be enough to say they are general phenomena.
- Spectral analysis of the weights is an informative approach that has yielded several theoretical insights in linear networks and informed empirical algorithms, particularly in recent approaches for low-rank adaptation. In the past, such analyses have been done in isolation on simple setups. So this paper's attempt to extend the scope of the insights into something more general is novel. - The comprehensive experiments touch upon a wide array of observed phenomena in deep learning and generally
I have some critiques concerning the writing, particularly the lack of any solid insights. I elaborate below. - The paper seems to be largely an exercise in plotting the singular value dynamics of several neural networks. While this is interesting in itself, I found that the paper did not extract any conclusive or striking insights from the empirical study, making it difficult to judge the significance of the work. For example, Figure 2 points out that grokking seems to be correlated with drops
The paper introduce some situations in deep learning and explain them by the spectral dynamics of weights. The advantages are 1) The paper establishes a link between ''grokkking'' and rank minimization. 2) The paper provides different examples of deep learning, including CNN, UNet, LSTM, and transformer.
1) There is no clear description of the influences of spectral dynamics on deep learning. For example, in the line 421, weight decay can produces a low-rank behavior, but how much weight decay is needed or is weight decay related with some performance metrics? Figure 6 shows a phenomenon of adding weight decay. However, there lacks a metrics to show the level of weight decay. "The exact choice of “too much” varies across architectures and tasks." could not explain everything. 2) The theoretic
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
MethodsWeight Decay
