Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Andy Zeyi Liu, Elliot Paquette, John Sous

TL;DR
This paper introduces spectral diagnostics based on activation and gradient spectra to analyze internal representations and learning dynamics in language model training, revealing key factors like batch size effects and early predictors of efficiency.
Contribution
It presents a novel spectral measurement protocol using activation covariance and gradient SVD spectra to diagnose and understand LLM training mechanics.
Findings
Batch size influences representation geometry and loss convergence.
Early activation covariance spectra predict downstream token efficiency.
Spectral changes distinguish learning improvements from execution gains.
Abstract
Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
