Distributed Hessian-Free Optimization for Deep Neural Network
Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy, Martin, Tak\'a\v{c}

TL;DR
This paper introduces a distributed Hessian-free optimization method for deep neural networks that efficiently utilizes large-scale computing resources, explores negative curvature directions, and accelerates training, especially in large batch scenarios.
Contribution
It develops a novel distributed Hessian-free optimization algorithm that incorporates negative curvature exploration, enabling faster training and better scaling than traditional methods.
Findings
Achieves near-linear speed-up on 16 CPU nodes.
Demonstrates faster training on MNIST and TIMIT datasets.
Enables large batch training with robust performance.
Abstract
Training deep neural network is a high dimensional and a highly non-convex optimization problem. Stochastic gradient descent (SGD) algorithm and it's variations are the current state-of-the-art solvers for this task. However, due to non-covexity nature of the problem, it was observed that SGD slows down near saddle point. Recent empirical work claim that by detecting and escaping saddle point efficiently, it's more likely to improve training performance. With this objective, we revisit Hessian-free optimization method for deep networks. We also develop its distributed variant and demonstrate superior scaling potential to SGD, which allows more efficiently utilizing larger computing resources thus enabling large models and faster time to obtain desired solution. Furthermore, unlike truncated Newton method (Marten's HF) that ignores negative curvature information by using na\"ive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications
MethodsStochastic Gradient Descent
