TL;DR
This paper investigates the non-technical human factors leading to HPC job termination, quantifies the resulting wasted computation, and suggests that addressing these issues can enhance the overall ROI of HPC clusters.
Contribution
It provides a case study analyzing user-initiated job termination reasons and quantifies associated wasted computation, highlighting non-technical factors impacting HPC efficiency.
Findings
Users terminate jobs for various non-technical reasons.
User-terminated jobs contribute significantly to wasted computation.
Reducing user-initiated job failures can improve HPC ROI.
Abstract
Given the cost of HPC clusters, making best use of them is crucial to improve infrastructure ROI. Likewise, reducing failed HPC jobs and related waste in terms of user wait times is crucial to improve HPC user productivity (aka human ROI). While most efforts (e.g.,debugging HPC programs) explore technical aspects to improve ROI of HPC clusters, we hypothesize non-technical (human) aspects are worth exploring to make non-trivial ROI gains; specifically, understanding non-technical aspects and how they contribute to the failure of HPC jobs. In this regard, we conducted a case study in the context of Beocat cluster at Kansas State University. The purpose of the study was to learn the reasons why users terminate jobs and to quantify wasted computations in such jobs in terms of system utilization and user wait time. The data from the case study helped identify interesting and actionable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
