DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems
Alessio Netti, Micha Mueller, Carla Guillen, Michael Ott, Daniele, Tafani, Gence Ozer, Martin Schulz

TL;DR
Wintermute is a flexible, scalable framework that enables real-time operational data analytics on large HPC systems, addressing the lack of comprehensive solutions for holistic system management.
Contribution
The paper introduces Wintermute, a novel framework for online operational data analytics on HPC systems, built on a comprehensive survey and designed for flexibility and scalability.
Findings
Wintermute supports diverse ODA applications with configurable options.
It demonstrates low resource usage in practical case studies.
The framework enhances holistic HPC system management.
Abstract
As we approach the exascale era, the size and complexity of HPC systems continues to increase, raising concerns about their manageability and sustainability. For this reason, more and more HPC centers are experimenting with fine-grained monitoring coupled with Operational Data Analytics (ODA) to optimize efficiency and effectiveness of system operations. However, while monitoring is a common reality in HPC, there is no well-stated and comprehensive list of requirements, nor matching frameworks, to support holistic and online ODA. This leads to insular ad-hoc solutions, each addressing only specific aspects of the problem. In this paper we propose Wintermute, a novel generic framework to enable online ODA on large-scale HPC installations. Its design is based on the results of a literature survey of common operational requirements. We implement Wintermute on top of the holistic DCDB…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
