SGD with Partial Hessian for Deep Neural Networks Optimization
Ying Sun, Hongwei Yong, Lei Zhang

TL;DR
This paper introduces SGD with Partial Hessian (SGD-PH), a novel optimizer combining second-order channel-wise Hessian information with first-order SGD to improve deep neural network training stability and performance.
Contribution
The paper proposes a new compound optimizer, SGD-PH, that accurately extracts partial Hessian matrices for channel-wise parameters, enhancing optimization in deep neural networks.
Findings
SGD-PH outperforms traditional optimizers on image classification tasks.
Partial Hessian information improves convergence stability.
The method maintains good generalization performance.
Abstract
Due to the effectiveness of second-order algorithms in solving classical optimization problems, designing second-order optimizers to train deep neural networks (DNNs) has attracted much research interest in recent years. However, because of the very high dimension of intermediate features in DNNs, it is difficult to directly compute and store the Hessian matrix for network optimization. Most of the previous second-order methods approximate the Hessian information imprecisely, resulting in unstable performance. In this work, we propose a compound optimizer, which is a combination of a second-order optimizer with a precise partial Hessian matrix for updating channel-wise parameters and the first-order stochastic gradient descent (SGD) optimizer for updating the other parameters. We show that the associated Hessian matrices of channel-wise parameters are diagonal and can be extracted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques
MethodsStochastic Gradient Descent
