Distributed Newton Methods for Deep Neural Networks

Chien-Chih Wang; Kent Loong Tan; Chun-Ting Chen; Yu-Hsiang Lin; S.; Sathiya Keerthi; Dhruv Mahajan; S. Sundararajan; Chih-Jen Lin

arXiv:1802.00130·stat.ML·February 2, 2018·1 cites

Distributed Newton Methods for Deep Neural Networks

Chien-Chih Wang, Kent Loong Tan, Chun-Ting Chen, Yu-Hsiang Lin, S., Sathiya Keerthi, Dhruv Mahajan, S. Sundararajan, Chih-Jen Lin

PDF

Open Access

TL;DR

This paper introduces a novel distributed Newton method for training deep neural networks that reduces communication and synchronization costs, improving robustness and accuracy over stochastic gradient methods.

Contribution

The paper proposes a new distributed Newton approach with techniques to reduce communication, memory, and synchronization costs in deep neural network training.

Findings

01

Effective in distributed training scenarios

02

More robust than stochastic gradient methods

03

Potentially better test accuracy

Abstract

Deep learning involves a difficult non-convex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this paper, we focus on situations where the model is distributedly stored, and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM