HessFormer: Hessians at Foundation Scale
Diego Granziol

TL;DR
HessFormer is a software tool enabling distributed Hessian vector computations for large-scale deep learning models, facilitating spectral analysis of billion-parameter models on multiple GPUs.
Contribution
We introduce HessFormer, a scalable software package for distributed Hessian computations, and demonstrate its application on a 70-billion-parameter model.
Findings
Distributed Hessian computation is feasible for models with billions of parameters.
Spectral density analysis reveals insights into large model optimization landscapes.
The package integrates with existing transformer frameworks for ease of use.
Abstract
Whilst there have been major advancements in the field of first order optimisation of deep learning models, where state of the art open source mixture of expert models go into the hundreds of billions of parameters, methods that rely on Hessian vector products, are still limited to run on a single GPU and thus cannot even work for models in the billion parameter range. We release a software package \textbf{HessFormer}, which integrates nicely with the well known Transformers package and allows for distributed hessian vector computation across a single node with multiple GPUs. Underpinning our implementation is a distributed stochastic lanczos quadrature algorithm, which we release for public consumption. Using this package we investigate the Hessian spectral density of the recent Deepseek bn parameter model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Generative Adversarial Networks and Image Synthesis
