Gathering and Exploiting Higher-Order Information when Training Large Structured Models
Pierre Wolinski

TL;DR
This paper introduces a method for explicitly computing higher-order derivative projections on parameter subspaces in large models, enabling advanced optimization and regularization techniques that consider long-range interactions.
Contribution
It presents a novel approach for exact higher-order derivative computation on parameter partitions, facilitating improved optimization and hyperparameter tuning in large neural networks.
Findings
Enables computation of higher-order derivatives on parameter subsets at reasonable cost.
Allows for per-subset learning rate adaptation for hyperparameter tuning.
Incorporates long-range layer interactions in optimization, improving training dynamics.
Abstract
When training large models, such as neural networks, the full derivatives of order 2 and beyond are usually inaccessible, due to their computational cost. Therefore, among the second-order optimization methods, it is common to bypass the computation of the Hessian by using first-order information, such as the gradient of the parameters (e.g., quasi-Newton methods) or the activations (e.g., K-FAC). In this paper, we focus on the exact and explicit computation of projections of the Hessian and higher-order derivatives on well-chosen subspaces relevant for optimization. Namely, for a given partition of the set of parameters, we compute tensors that can be seen as "higher-order derivatives according to the partition", at a reasonable cost as long as the number of subsets of the partition remains small. Then, we give some examples of how these tensors can be used. First, we show how to…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper is in general well motivated with the need to capture the interactions between parameters in different layers, which is often ignored by block-diagonal methods. This is operationalized in a natural way by studying the Hessian with suitable contractions. - The approach of getting layerwise learning rates through their layerwise grouping is neat. This offers a principled extension to the Cauchy's steepest descent rule. - The method could also find utility in studying the behaviour
- The experimental section is quite weak. I understand that the authors themselves pitch it as a proof-of-concept, but I am not so sure about even if you can call it a proof of concept. The experiments are on small datasets like CIFAR, even over there none of the methods get the typical 90% and above accuracy, the test accuracy of their method is much worse than K-FAC. - More fundamentally, it is unclear to me where lies the bigger problem: correcting the curvature across layers, or that withi
The authors present a very interesting idea, and thoroughly motivate the idea. It really is a big short-coming that many related papers on this topic neglect inter-layer interactions when designing optimization algorithms. And finding computationally tractable ways to "summarize" the Hessian is a very strong idea. The discussion of related work is quite comprehensive and clear. Generally the work feels quite thoughtful: considering what are the issues with Newton's method, and how to try to ov
I think the method could potentially be introduced in a more straightforward way, and I want to suggest one way. I think the method could be viewed as a change of variables to a smaller set of local optimization variables. In particular, instead of viewing the loss as a function of general perturbations to all the weight tensors: loss( W_1 + ∆W_1, W_2 + ∆W_2, ..., W_S + ∆W_S) we can view the loss as as a function of scalar-parameterized perturbations to each layer: loss( W_1 - η_1 * G_1 , W_2
The paper has several strengths: 1) The interest in efficient computational schemes of higher order information for deep neural networks is significant, and improved methods of estimating the Hessian, as well as higher order terms could help improve interpretability and shed light on what neural networks learn during optimization. 2) The partitioning scheme is the main contribution of this paper, and seems to be novel as far as I know, with possible applications to many interesting avenues.
In spite of the strengths, the paper has some clear drawbacks in my opinion: 1) The paper is not written well enough, with a substantial lack of a literature survey on properties of Hessians in deep networks, as well as works on second order methods from recent years. In terms of the writing itself, all of the equations on page 5 are unnumbered making them hard to refer to, and the chosen notation for tensor contraction $A[ u, u... u]$ is not standard and never explained. It must be understood
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTensor decomposition and applications · Model Reduction and Neural Networks · Computational Physics and Python Applications
