A Holistic Approach to Log Data Analysis in High-Performance Computing Systems: The Case of IBM Blue Gene/Q
Alina S\^irbu, Ozalp Babaoglu

TL;DR
This paper presents a comprehensive analysis of IBM Blue Gene/Q high-performance computing system logs, revealing diverse correlations among power, temperature, workload, and events to inform future predictive management models.
Contribution
It offers a detailed multi-scale characterization of system logs from multiple data sources, highlighting correlation patterns crucial for predictive modeling in HPC systems.
Findings
Low correlation among components for temperature and power
High correlation among components for hardware/software events
Power strongly correlates with temperature, negatively with events
Abstract
The complexity and cost of managing high-performance computing infrastructures are on the rise. Automating management and repair through predictive models to minimize human interventions is an attempt to increase system availability and contain these costs. Building predictive models that are accurate enough to be useful in automatic management cannot be based on restricted log data from subsystems but requires a holistic approach to data analysis from disparate sources. Here we provide a detailed multi-scale characterization study based on four datasets reporting power consumption, temperature, workload, and hardware/software events for an IBM Blue Gene/Q installation. We show that the system runs a rich parallel workload, with low correlation among its components in terms of temperature and power, but higher correlation in terms of events. As expected, power and temperature correlate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Software System Performance and Reliability · IoT and Edge/Fog Computing
