ChainerMN: Scalable Distributed Deep Learning Framework

Takuya Akiba; Keisuke Fukuda; Shuji Suzuki

arXiv:1710.11351·cs.DC·November 1, 2017·58 cites

ChainerMN: Scalable Distributed Deep Learning Framework

Takuya Akiba, Keisuke Fukuda, Shuji Suzuki

PDF

Open Access 1 Repo

TL;DR

ChainerMN is a scalable distributed deep learning framework that efficiently utilizes multiple GPUs, enabling large-scale training like ResNet-50 on ImageNet with high parallel efficiency.

Contribution

This paper introduces ChainerMN, a new distributed deep learning framework that achieves high scalability and efficiency across multiple GPUs.

Findings

01

Scales ResNet-50 training to 128 GPUs

02

Achieves 90% parallel efficiency

03

Demonstrates effective distributed training performance

Abstract

One of the keys for deep learning to have made a breakthrough in various fields was to utilize high computing powers centering around GPUs. Enabling the use of further computing abilities by distributed processing is essential not only to make the deep learning bigger and faster but also to tackle unsolved challenges. We present the design, implementation, and evaluation of ChainerMN, the distributed deep learning framework we have developed. We demonstrate that ChainerMN can scale the learning process of the ResNet-50 model to the ImageNet dataset up to 128 GPUs with the parallel efficiency of 90%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chainer/chainermn
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques