GaDei: On Scale-up Training As A Service For Deep Learning
Wei Zhang, Minwei Feng, Yunhui Zheng, Yufei Ren, Yandong Wang, Ji Liu,, Peng Liu, Bing Xiang, Li Zhang, Bowen Zhou, Fei Wang

TL;DR
GaDei is a scale-up deep learning training system optimized for high communication bandwidth, providing fault-tolerance and outperforming existing solutions, specifically designed for training-as-a-service scenarios with fixed hyper-parameters.
Contribution
We introduce GaDei, a shared-memory based scale-up parameter server system that handles high bandwidth communication, guarantees deadlock-free operation, and offers fault-tolerance for deep learning training-as-a-service.
Findings
GaDei significantly outperforms state-of-the-art parameter-server implementations.
It maintains model accuracy with fixed hyper-parameters across diverse workloads.
GaDei achieves near hardware-limited runtime performance.
Abstract
Deep learning (DL) training-as-a-service (TaaS) is an important emerging industrial workload. The unique challenge of TaaS is that it must satisfy a wide range of customers who have no experience and resources to tune DL hyper-parameters, and meticulous tuning for each user's dataset is prohibitively expensive. Therefore, TaaS hyper-parameters must be fixed with values that are applicable to all users. IBM Watson Natural Language Classifier (NLC) service, the most popular IBM cognitive service used by thousands of enterprise-level clients around the globe, is a typical TaaS service. By evaluating the NLC workloads, we show that only the conservative hyper-parameter setup (e.g., small mini-batch size and small learning rate) can guarantee acceptable model accuracy for a wide range of customers. We further justify theoretically why such a setup guarantees better model convergence in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Advanced Memory and Neural Computing
