Towards Accommodating Real-time Jobs on HPC Platforms
Sam Nickolay, Eun-Sung Jung, Rajkumar Kettimuthu, Ian Foster

TL;DR
This paper addresses the challenge of scheduling real-time jobs on HPC platforms by formulating the problem, proposing heuristics, and evaluating their impact on system performance through simulations, aiming to improve real-time responsiveness without significantly harming batch jobs.
Contribution
The paper introduces novel scheduling heuristics for real-time jobs on HPC systems and provides a comprehensive simulation-based analysis of their effectiveness.
Findings
Real-time jobs can be scheduled effectively with minimal impact on batch jobs.
Just-in-time checkpointing improves real-time job slowdowns by 35%.
Batch job slowdowns increase by only 10% with 10% real-time job workload.
Abstract
Increasing data volumes in scientific experiments necessitate the use of high-performance computing (HPC) resources for data analysis. In many scientific fields, the data generated from scientific instruments and supercomputer simulations must be analyzed rapidly. In fact, the requirement for quasi-instant feedback is growing. Scientists want to use results from one experiment to guide the selection of the next or even to improve the course of a single experiment. Current HPC systems are typically batch-scheduled under policies in which an arriving job is run immediately only if enough resources are available; otherwise, it is queued. It is hard for these systems to support real-time jobs. Real-time jobs, in order to meet their requirements, should sometimes have to preempt batch jobs and/or be scheduled ahead of batch jobs that were submitted earlier. Accommodating real-time jobs may…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
