Scaling Deep Learning on GPU and Knights Landing clusters

Yang You; Aydin Buluc; James Demmel

arXiv:1708.02983·cs.DC·August 11, 2017

Scaling Deep Learning on GPU and Knights Landing clusters

Yang You, Aydin Buluc, James Demmel

PDF

TL;DR

This paper improves distributed deep learning training efficiency on GPU and KNL clusters by redesigning EASGD algorithms and applying system-algorithm codesign, achieving significant speedups and high scalability.

Contribution

It introduces new HPC-optimized variants of EASGD, including Sync EASGD, with system-algorithm codesign techniques to enhance scalability and reduce communication overhead.

Findings

01

Sync EASGD achieves 5.3x speedup over original EASGD.

02

Communication overhead reduced from 87% to 14%.

03

91.5% weak scaling efficiency on 4253 KNL cores.

Abstract

The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.