Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training
Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind, Krishnamurthy

TL;DR
This paper introduces PBox, a balanced hardware and software solution for distributed deep neural network training that significantly improves scalability and training speed in cloud environments.
Contribution
The paper presents PBox hardware and PHub software, optimizing parameter server performance to enhance distributed DNN training efficiency in cloud settings.
Findings
Achieves up to 3.8x speedup on ImageNet training
Balances compute and communication for scalable training
Improves existing parameter server frameworks
Abstract
Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift the training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks.We propose PBox, a balanced, scalable central PS hardware that balances compute and communication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
