Scaling Distributed Deep Learning Workloads beyond the Memory Capacity   with KARMA

Mohamed Wahib; Haoyu Zhang; Truong Thao Nguyen; Aleksandr Drozd; Jens; Domke; Lingqi Zhang; Ryousei Takano; Satoshi Matsuoka

arXiv:2008.11421·cs.DC·August 27, 2020·1 cites

Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA

Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens, Domke, Lingqi Zhang, Ryousei Takano, Satoshi Matsuoka

PDF

Open Access

TL;DR

This paper presents KARMA, a novel out-of-core training strategy that enables scaling large deep learning models beyond hardware memory limits, outperforming existing methods in speed and efficiency.

Contribution

KARMA introduces a combined layer swapping and recomputing approach, including the first multi-node out-of-core training method with pipelined gradient exchange.

Findings

01

Achieves 1.52x speedup over state-of-the-art out-of-core methods

02

Outperforms hybrid model parallelism on large models like Megatron-LM

03

Enables efficient multi-node out-of-core training with pipelining

Abstract

The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism. We propose a performance model based on the concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. We also introduce the first method to solve the challenging problem of out-of-core multi-node training by carefully pipelining gradient exchanges and performing the parameter updates on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Topic Modeling