Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
Mohamed Wahib, Haoyu Zhang, Truong Thao Nguyen, Aleksandr Drozd, Jens, Domke, Lingqi Zhang, Ryousei Takano, Satoshi Matsuoka

TL;DR
This paper presents KARMA, a novel out-of-core training strategy that enables scaling large deep learning models beyond hardware memory limits, outperforming existing methods in speed and efficiency.
Contribution
KARMA introduces a combined layer swapping and recomputing approach, including the first multi-node out-of-core training method with pipelined gradient exchange.
Findings
Achieves 1.52x speedup over state-of-the-art out-of-core methods
Outperforms hybrid model parallelism on large models like Megatron-LM
Enables efficient multi-node out-of-core training with pipelining
Abstract
The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism. We propose a performance model based on the concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. We also introduce the first method to solve the challenging problem of out-of-core multi-node training by carefully pipelining gradient exchanges and performing the parameter updates on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Topic Modeling
