End-to-End Predictions-Based Resource Management Framework for Supercomputer Jobs
Swetha Hariharan, Prakash Murali, Abhishek Pasari, Sathish Vadhiyar

TL;DR
This paper presents an end-to-end resource management framework for supercomputers that uses adaptive predictions of queue waiting and execution times to optimize job response times, demonstrating significant improvements over existing methods.
Contribution
It introduces a novel framework that adaptively predicts queue and execution times and employs these predictions to optimize resource allocation and job scheduling in supercomputers.
Findings
Significant reduction in job response times in simulations.
Improved prediction accuracy for queue and execution times.
Effective dynamic adjustment of job submission parameters.
Abstract
Job submissions of parallel applications to production supercomputer systems will have to be carefully tuned in terms of the job submission parameters to obtain minimum response times. In this work, we have developed an end-to-end resource management framework that uses predictions of queue waiting and execution times to minimize response times of user jobs submitted to supercomputer systems. Our method for predicting queue waiting times adaptively chooses a prediction method based on the cluster structure of similar jobs. Our strategy for execution time predictions dynamically learns the impact of load on execution times and uses this to predict a set of execution time ranges for the target job. We have developed two resource management techniques that employ these predictions, one that selects the number of processors for execution and the other that also dynamically changes the job…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
