When do spectral gradient updates help in deep learning?
Damek Davis, Dmitriy Drusvyatskiy

TL;DR
This paper introduces a layerwise condition predicting when spectral gradient updates outperform Euclidean ones in deep learning, supported by theoretical analysis and experiments on neural networks and language models.
Contribution
It proposes a simple criterion based on matrix ratios to identify regimes where spectral methods are advantageous, with theoretical proofs and empirical validation.
Findings
Spectral updates outperform Euclidean updates when the nuclear-to-Frobenius ratio is high.
Post-activation matrices exhibit low stable rank at initialization in various models.
In trained models, activation stable rank remains low, favoring spectral methods.
Abstract
Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Muon and positron interactions and applications · Particle physics theoretical and experimental studies
