Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, Nam Sung, Kim

TL;DR
Harmony is a new training framework that significantly reduces memory swapping and boosts training speed for massive DNN models on single commodity servers, enabling broader access to large-scale neural network training.
Contribution
Harmony introduces a novel scheduling and data movement approach that overcomes GPU memory limitations without excessive swapping, allowing efficient training of large models on limited-resource servers.
Findings
Up to 100x reduction in swap load
Up to 7.6x increase in training throughput
Effective on various massive DNN models
Abstract
Deep neural networks (DNNs) have grown exponentially in size over the past decade, leaving only those who have massive datacenter-based resources with the ability to develop and train such models. One of the main challenges for the long tail of researchers who might have only limited resources (e.g., a single multi-GPU server) is limited GPU memory capacity compared to model size. The problem is so acute that the memory requirement of training massive DNN models can often exceed the aggregate capacity of all available GPUs on a single server; this problem only gets worse with the trend of ever-growing model sizes. Current solutions that rely on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive swapping overhead. In this paper, we present a new training framework, Harmony, and advocate rethinking how DNN frameworks schedule computation and move data to push the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · IoT and Edge/Fog Computing
