Optimized Memoryless Fair-Share HPC Resources Scheduling using   Transparent Checkpoint-Restart Preemption

Kfir Zvi; Gal Oren

arXiv:2102.12953·cs.DC·February 26, 2021·1 cites

Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption

Kfir Zvi, Gal Oren

PDF

Open Access

TL;DR

This paper introduces a novel memoryless fair-share scheduling method for HPC resources that uses transparent checkpoint-restart preemption to improve system utilization and fairness without additional costs or penalties.

Contribution

It presents a new scheduling approach that enhances resource allocation efficiency and fairness in supercomputing by leveraging transparent checkpoint-restart preemption.

Findings

01

Increased system utilization and fairness.

02

No additional costs or penalties for users.

03

Enhanced resource allocation efficiency.

Abstract

Common resource management methods in supercomputing systems usually include hard divisions, capping, and quota allotment. Those methods, despite their 'advantages', have some known serious disadvantages including unoptimized utilization of an expensive facility, and occasionally there is still a need to dynamically reschedule and reallocate the resources. Consequently, those methods involve bad supply-and-demand management rather than a free market playground that will eventually increase system utilization and productivity. In this work, we propose the newly Optimized Memoryless Fair-Share HPC Resources Scheduling using Transparent Checkpoint-Restart Preemption, in which the social welfare increases using a free-of-cost interchangeable proprietary possession scheme. Accordingly, we permanently keep the status-quo in regard to the fairness of the resources distribution while maximizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques