Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers
Thomas Jakobsche, Nicolas Lachiche, Florina M. Ciorba

TL;DR
This paper explores how machine learning can enhance monitoring and data analytics in the design and operation of supercomputers, emphasizing the need for collaboration between HPC and ML communities to improve system efficiency and reliability.
Contribution
It highlights the challenges and opportunities of applying ML to HPC monitoring and analytics, advocating for closer collaboration to unlock data-driven insights.
Findings
ML can improve HPC system efficiency and reliability
Close collaboration between HPC and ML communities is essential
Data-driven analysis can surpass expert-based rules
Abstract
This work examines the challenges and opportunities of Machine Learning (ML) for Monitoring and Operational Data Analytics (MODA) in the context of Quantitative Codesign of Supercomputers (QCS). MODA is employed to gain insights into the behavior of current High Performance Computing (HPC) systems to improve system efficiency, performance, and reliability (e.g. through optimizing cooling infrastructure, job scheduling, and application parameter tuning). In this work, we take the position that QCS in general, and MODA in particular, require close exchange with the ML community to realize the full potential of data-driven analysis for the benefit of existing and future HPC systems. This exchange will facilitate identifying the appropriate ML methods to gain insights into current HPC systems and to go beyond expert-based knowledge and rules of thumb.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
