The importance and need for system monitoring and analysis in HPC operations and research
Florina M. Ciorba

TL;DR
This paper emphasizes the importance of comprehensive system monitoring and analysis in HPC to improve system understanding, reliability, and efficiency, proposing a holistic model based on extensive data collection.
Contribution
It introduces a vision for a holistic HPC system model derived from extensive monitoring data, aiming to enhance system design, maintenance, and research capabilities.
Findings
Monitoring data reveals complex hardware-software interactions.
A comprehensive model can improve HPC system reliability.
Recommendations for implementing holistic monitoring are provided.
Abstract
In this work, system monitoring and analysis are discussed in terms of their significance and benefits for operations and research in the field of high-performance computing (HPC). HPC systems deliver unique insights to computational scientists from different disciplines. It is argued that research in HPC is also computational in nature, given the massive amounts of monitoring data collected at various levels of an HPC system. The vision of a comprehensive system model developed based on holistic monitoring and analysis is also presented. The goal and expected outcome of such a model is an improved understanding of the intricate interactions between today's software and hardware, and their diverse usage patterns. The associated modeling, monitoring, and analysis challenges are reviewed and discussed. The envisioned comprehensive system model will provide the ability to design future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
