An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of   Convolutional Neural Networks

Albert Njoroge Kahira; Truong Thao Nguyen; Leonardo Bautista Gomez,; Ryousei Takano; Rosa M Badia; Mohamed Wahib

arXiv:2104.09075·cs.DC·April 20, 2021

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

Albert Njoroge Kahira, Truong Thao Nguyen, Leonardo Bautista Gomez,, Ryousei Takano, Rosa M Badia, Mohamed Wahib

PDF

1 Repo

TL;DR

This paper introduces an oracle tool that analyzes and predicts the performance bottlenecks of various parallel training strategies for CNNs, aiding scalable large-scale model training.

Contribution

It provides a model-driven analysis framework and an oracle utility to identify limitations in different parallelism approaches for CNN training at scale.

Findings

01

Oracle achieves 86.74% accuracy in predicting bottlenecks.

02

High accuracy of 97.57% for data parallelism.

03

Evaluated on six strategies, four CNN models, and multiple datasets.

Abstract

Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

billmj/UTEP_PNNL_DeepLearning_Optimization
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.