What does fault tolerant Deep Learning need from MPI?
Vinay Amatya, Abhinav Vishnu, Charles Siegel, Jeff Daily

TL;DR
This paper explores the requirements for fault tolerant MPI in deep learning, proposing extensions and evaluating them with large-scale experiments to improve resilience during long training runs.
Contribution
It analyzes fault tolerance needs for MPI in deep learning, evaluates existing MPI fault tolerance proposals, and extends MaTEx-Caffe with ULFM for improved fault resilience.
Findings
ULFM-based MPI effectively handles faults in large-scale DL training
Fault tolerance features are crucial for long-duration DL jobs on large datasets
Extended MaTEx-Caffe demonstrates improved robustness with ULFM
Abstract
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive - even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults - requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for de- signing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Radiation Effects in Electronics · Adversarial Robustness in Machine Learning
