Gathering and Exploiting Higher-Order Information when Training Large Structured Models

Pierre Wolinski

arXiv:2312.03885·cs.LG·October 1, 2025·1 cites

Gathering and Exploiting Higher-Order Information when Training Large Structured Models

Pierre Wolinski

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a method for explicitly computing higher-order derivative projections on parameter subspaces in large models, enabling advanced optimization and regularization techniques that consider long-range interactions.

Contribution

It presents a novel approach for exact higher-order derivative computation on parameter partitions, facilitating improved optimization and hyperparameter tuning in large neural networks.

Findings

01

Enables computation of higher-order derivatives on parameter subsets at reasonable cost.

02

Allows for per-subset learning rate adaptation for hyperparameter tuning.

03

Incorporates long-range layer interactions in optimization, improving training dynamics.

Abstract

When training large models, such as neural networks, the full derivatives of order 2 and beyond are usually inaccessible, due to their computational cost. Therefore, among the second-order optimization methods, it is common to bypass the computation of the Hessian by using first-order information, such as the gradient of the parameters (e.g., quasi-Newton methods) or the activations (e.g., K-FAC). In this paper, we focus on the exact and explicit computation of projections of the Hessian and higher-order derivatives on well-chosen subspaces relevant for optimization. Namely, for a given partition of the set of parameters, we compute tensors that can be seen as "higher-order derivatives according to the partition", at a reasonable cost as long as the number of subsets of the partition remains small. Then, we give some examples of how these tensors can be used. First, we show how to…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The paper is in general well motivated with the need to capture the interactions between parameters in different layers, which is often ignored by block-diagonal methods. This is operationalized in a natural way by studying the Hessian with suitable contractions. - The approach of getting layerwise learning rates through their layerwise grouping is neat. This offers a principled extension to the Cauchy's steepest descent rule. - The method could also find utility in studying the behaviour

Weaknesses

- The experimental section is quite weak. I understand that the authors themselves pitch it as a proof-of-concept, but I am not so sure about even if you can call it a proof of concept. The experiments are on small datasets like CIFAR, even over there none of the methods get the typical 90% and above accuracy, the test accuracy of their method is much worse than K-FAC. - More fundamentally, it is unclear to me where lies the bigger problem: correcting the curvature across layers, or that withi

Reviewer 02Rating 6Confidence 4

Strengths

The authors present a very interesting idea, and thoroughly motivate the idea. It really is a big short-coming that many related papers on this topic neglect inter-layer interactions when designing optimization algorithms. And finding computationally tractable ways to "summarize" the Hessian is a very strong idea. The discussion of related work is quite comprehensive and clear. Generally the work feels quite thoughtful: considering what are the issues with Newton's method, and how to try to ov

Weaknesses

I think the method could potentially be introduced in a more straightforward way, and I want to suggest one way. I think the method could be viewed as a change of variables to a smaller set of local optimization variables. In particular, instead of viewing the loss as a function of general perturbations to all the weight tensors: loss( W_1 + ∆W_1, W_2 + ∆W_2, ..., W_S + ∆W_S) we can view the loss as as a function of scalar-parameterized perturbations to each layer: loss( W_1 - η_1 * G_1 , W_2

Reviewer 03Rating 5Confidence 3

Strengths

The paper has several strengths: 1) The interest in efficient computational schemes of higher order information for deep neural networks is significant, and improved methods of estimating the Hessian, as well as higher order terms could help improve interpretability and shed light on what neural networks learn during optimization. 2) The partitioning scheme is the main contribution of this paper, and seems to be novel as far as I know, with possible applications to many interesting avenues.

Weaknesses

In spite of the strengths, the paper has some clear drawbacks in my opinion: 1) The paper is not written well enough, with a substantial lack of a literature survey on properties of Hessians in deep networks, as well as works on second order methods from recent years. In terms of the writing itself, all of the equations on page 5 are unnumbered making them hard to refer to, and the chosen notation for tensor contraction $A[ u, u... u]$ is not standard and never explained. It must be understood

Code & Models

Repositories

p-wol/GroupedNewton
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Model Reduction and Neural Networks · Computational Physics and Python Applications