Hessian Spectral Analysis at Foundation Model Scale
Diego Granziol, Khurshid Juarev

TL;DR
This paper demonstrates that accurate Hessian spectral analysis of large foundation models is feasible at scale, revealing limitations of common approximations and enabling more principled curvature-based analysis.
Contribution
It introduces a scalable method for spectral analysis of the true Hessian at model scales up to 100B parameters, surpassing prior small-model limitations.
Findings
Hessian spectra can be accurately estimated at large scale.
Block-diagonal curvature approximations can be significantly inaccurate.
Spectral analysis incurs modest overhead over first-order training.
Abstract
Accurate Hessian spectra of foundation models have remained out of reach, leading most prior work to rely on small models or strong structural approximations. We show that faithful spectral analysis of the true Hessian is tractable at frontier scale. Using shard-local finite-difference Hessian vector products compatible with Fully Sharded Data Parallelism, we perform stochastic Lanczos quadrature on open-source language models with up to 100B parameters, producing the first large-scale spectral density estimates beyond the sub-10B regime. We characterize the numerical behavior of this pipeline, including finite-difference bias, floating-point noise amplification, and their effect on Krylov stability in fp32 and bf16, and derive practical operating regimes that are validated empirically. We further provide end-to-end runtime and memory scaling laws, showing that full-operator spectral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices
