Distributed Hessian-Free Optimization for Deep Neural Network

Xi He; Dheevatsa Mudigere; Mikhail Smelyanskiy; Martin; Tak\'a\v{c}

arXiv:1606.00511·cs.LG·January 17, 2017·1 cites

Distributed Hessian-Free Optimization for Deep Neural Network

Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy, Martin, Tak\'a\v{c}

PDF

Open Access

TL;DR

This paper introduces a distributed Hessian-free optimization method for deep neural networks that efficiently utilizes large-scale computing resources, explores negative curvature directions, and accelerates training, especially in large batch scenarios.

Contribution

It develops a novel distributed Hessian-free optimization algorithm that incorporates negative curvature exploration, enabling faster training and better scaling than traditional methods.

Findings

01

Achieves near-linear speed-up on 16 CPU nodes.

02

Demonstrates faster training on MNIST and TIMIT datasets.

03

Enables large batch training with robust performance.

Abstract

Training deep neural network is a high dimensional and a highly non-convex optimization problem. Stochastic gradient descent (SGD) algorithm and it's variations are the current state-of-the-art solvers for this task. However, due to non-covexity nature of the problem, it was observed that SGD slows down near saddle point. Recent empirical work claim that by detecting and escaping saddle point efficiently, it's more likely to improve training performance. With this objective, we revisit Hessian-free optimization method for deep networks. We also develop its distributed variant and demonstrate superior scaling potential to SGD, which allows more efficiently utilizing larger computing resources thus enabling large models and faster time to obtain desired solution. Furthermore, unlike truncated Newton method (Marten's HF) that ignores negative curvature information by using na\"ive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications

MethodsStochastic Gradient Descent