Parameter Box: High Performance Parameter Servers for Efficient   Distributed Deep Neural Network Training

Liang Luo; Jacob Nelson; Luis Ceze; Amar Phanishayee; Arvind; Krishnamurthy

arXiv:1801.09805·cs.DC·January 22, 2020

Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind, Krishnamurthy

PDF

Open Access

TL;DR

This paper introduces PBox, a balanced hardware and software solution for distributed deep neural network training that significantly improves scalability and training speed in cloud environments.

Contribution

The paper presents PBox hardware and PHub software, optimizing parameter server performance to enhance distributed DNN training efficiency in cloud settings.

Findings

01

Achieves up to 3.8x speedup on ImageNet training

02

Balances compute and communication for scalable training

03

Improves existing parameter server frameworks

Abstract

Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift the training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks.We propose PBox, a balanced, scalable central PS hardware that balances compute and communication…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications