MalleTrain: Deep Neural Network Training on Unfillable Supercomputer   Nodes

Xiaolong Ma; Feng Yan; Lei Yang; Ian Foster; Michael E. Papka,; Zhengchun Liu; Rajkumar Kettimuthu

arXiv:2404.15668·cs.DC·April 25, 2024

MalleTrain: Deep Neural Network Training on Unfillable Supercomputer Nodes

Xiaolong Ma, Feng Yan, Lei Yang, Ian Foster, Michael E. Papka,, Zhengchun Liu, Rajkumar Kettimuthu

PDF

TL;DR

MalleTrain enables efficient deep neural network training on idle supercomputer nodes by dynamically optimizing resource allocation without prior scalability information, significantly improving throughput.

Contribution

It introduces a practical system that generalizes previous MILP-based approaches by using online profiling to optimize DNN training on unfilled supercomputer nodes.

Findings

01

Achieves up to 22.3% increase in training throughput.

02

Demonstrates feasibility of utilizing idle supercomputer nodes for DNN training.

03

Works without requiring pre-runtime scalability information.

Abstract

First-come first-serve scheduling can result in substantial (up to 10%) of transiently idle nodes on supercomputers. Recognizing that such unfilled nodes are well-suited for deep neural network (DNN) training, due to the flexible nature of DNN training tasks, Liu et al. proposed that the re-scaling DNN training tasks to fit gaps in schedules be formulated as a mixed-integer linear programming (MILP) problem, and demonstrated via simulation the potential benefits of the approach. Here, we introduce MalleTrain, a system that provides the first practical implementation of this approach and that furthermore generalizes it by allowing it use even for DNN training applications for which model information is unknown before runtime. Key to this latter innovation is the use of a lightweight online job profiling advisor (JPA) to collect critical scalability information for DNN jobs -- information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.