BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes
Zhengchun Liu, Rajkumar Kettimuthu, Michael E. Papka, Ian Foster

TL;DR
This paper presents BFTrainer, a method to utilize idle supercomputer nodes for deep neural network training by dynamically fitting small training tasks into schedule holes, achieving up to 93% efficiency.
Contribution
It introduces a MILP-based algorithm to efficiently rescale and schedule DNN training tasks in supercomputers' idle slots, optimizing resource utilization.
Findings
Achieves up to 93% efficiency in resource utilization.
Effectively fits DNN training into transient supercomputer idle periods.
Validates approach with real scheduler logs and diverse training scenarios.
Abstract
Supercomputer FCFS-based scheduling policies result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragments that can be configured dynamically to fit essentially any node*time hole in a supercomputer's schedule. We describe how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and show that this MILP problem can be solved efficiently at run time. We show further how this MILP problem can be adapted to optimize for administrator- or user-defined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Ferroelectric and Negative Capacitance Devices
