Low-Order Explicit Hessian Imitation Method for Large-Scale Supervised Machine Learning
Yunlang Zhu, Lingjun Guo, Zahra Khatti, Xiaoyi Qu, Chia-Yuan Wu, Lara Zebiane, Frank E. Curtis

TL;DR
This paper introduces a novel optimization algorithm for neural network training that uses an auxiliary loss to efficiently approximate second-order derivatives, outperforming Adam in certain scenarios.
Contribution
The paper presents a low-order Hessian imitation method utilizing an auxiliary loss to create efficient second-derivative approximations for large-scale supervised learning.
Findings
The proposed method provides convergence guarantees similar to existing stochastic diagonal-scaling methods.
Numerical experiments show the algorithm can outperform Adam and other optimizers.
The approach maintains computational cost comparable to Adam while incorporating second-order information.
Abstract
An algorithm is proposed for solving optimization problems arising in neural network training for supervised learning. The unique feature of the algorithm is the use of an auxiliary loss, in addition to the original loss employed for model training. The purpose of the auxiliary loss is to provide a mechanism for creating a low-order Hessian-type approximation for the original loss. The proposed algorithm employs the resulting low-order second-derivative approximation terms in place of the second-order momentum terms (i.e., squared elements of the gradient of the loss function) in an overall scheme that has computational cost on par with an Adam-type approach. Whereas the squared elements of a gradient vector do not necessarily approximate second-order derivatives well, by careful construction of the auxiliary loss, second-order derivative-type approximations for the original loss can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
