ChainerMN: Scalable Distributed Deep Learning Framework
Takuya Akiba, Keisuke Fukuda, Shuji Suzuki

TL;DR
ChainerMN is a scalable distributed deep learning framework that efficiently utilizes multiple GPUs, enabling large-scale training like ResNet-50 on ImageNet with high parallel efficiency.
Contribution
This paper introduces ChainerMN, a new distributed deep learning framework that achieves high scalability and efficiency across multiple GPUs.
Findings
Scales ResNet-50 training to 128 GPUs
Achieves 90% parallel efficiency
Demonstrates effective distributed training performance
Abstract
One of the keys for deep learning to have made a breakthrough in various fields was to utilize high computing powers centering around GPUs. Enabling the use of further computing abilities by distributed processing is essential not only to make the deep learning bigger and faster but also to tackle unsolved challenges. We present the design, implementation, and evaluation of ChainerMN, the distributed deep learning framework we have developed. We demonstrate that ChainerMN can scale the learning process of the ResNet-50 model to the ImageNet dataset up to 128 GPUs with the parallel efficiency of 90%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
