Distributed Newton Methods for Deep Neural Networks
Chien-Chih Wang, Kent Loong Tan, Chun-Ting Chen, Yu-Hsiang Lin, S., Sathiya Keerthi, Dhruv Mahajan, S. Sundararajan, Chih-Jen Lin

TL;DR
This paper introduces a novel distributed Newton method for training deep neural networks that reduces communication and synchronization costs, improving robustness and accuracy over stochastic gradient methods.
Contribution
The paper proposes a new distributed Newton approach with techniques to reduce communication, memory, and synchronization costs in deep neural network training.
Findings
Effective in distributed training scenarios
More robust than stochastic gradient methods
Potentially better test accuracy
Abstract
Deep learning involves a difficult non-convex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this paper, we focus on situations where the model is distributedly stored, and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions, and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as the memory consumption. First, to reduce the communication cost, we propose a diagonalization method such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM
