Node-Based Job Scheduling for Large Scale Simulations of Short Running   Jobs

Chansup Byun; William Arcand; David Bestor; Bill Bergeron; Vijay; Gadepally; Michael Houle; Matthew Hubbell; Michael Jones; Anna Klein; Peter; Michaleas; Lauren Milechin; Julie Mullen; Andrew Prout; Albert Reuther,; Antonio Rosa; Siddharth Samsi; Charles Yee; Jeremy Kepner

arXiv:2108.11359·cs.DC·December 13, 2021

Node-Based Job Scheduling for Large Scale Simulations of Short Running Jobs

Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Vijay, Gadepally, Michael Houle, Matthew Hubbell, Michael Jones, Anna Klein, Peter, Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout, Albert Reuther,, Antonio Rosa, Siddharth Samsi, Charles Yee, Jeremy Kepner

PDF

TL;DR

This paper introduces a node-based scheduling method for large-scale simulations that significantly improves scheduler performance, enabling efficient resource utilization for both short and long jobs on supercomputing systems.

Contribution

The paper presents a novel node-based scheduling approach that achieves up to 100 times faster performance than existing systems for large-scale short jobs.

Findings

01

Up to 100x faster scheduler performance.

02

Efficient resource utilization for mixed workloads.

03

Applicable to large-scale supercomputing environments.

Abstract

Diverse workloads such as interactive supercomputing, big data analysis, and large-scale AI algorithm development, requires a high-performance scheduler. This paper presents a novel node-based scheduling approach for large scale simulations of short running jobs on MIT SuperCloud systems, that allows the resources to be fully utilized for both long running batch jobs while simultaneously providing fast launch and release of large-scale short running jobs. The node-based scheduling approach has demonstrated up to 100 times faster scheduler performance that other state-of-the-art systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.