Robust identification of thermal models for in-production   High-Performance-Computing clusters with machine learning-based data   selection

Federico Pittino; Roberto Diversi; Luca Benini; Andrea Bartolini

arXiv:1810.01865·cs.LG·November 8, 2018

Robust identification of thermal models for in-production High-Performance-Computing clusters with machine learning-based data selection

Federico Pittino, Roberto Diversi, Luca Benini, Andrea Bartolini

PDF

TL;DR

This paper presents a machine learning approach to select optimal data traces for thermal model identification in large-scale HPC systems, achieving high accuracy and addressing challenges posed by workload variability and measurement issues.

Contribution

It introduces a novel machine learning-based method for selecting data traces that enable accurate thermal model identification in in-production HPC systems.

Findings

01

Achieved average model error below sensor quantization step of 1°C.

02

Deep learning techniques correctly select data traces up to 96% of the time.

03

Not all workloads produce suitable data for accurate thermal modeling.

Abstract

Power and thermal management are critical components of High-Performance-Computing (HPC) systems, due to their high power density and large total power consumption. The assessment of thermal dissipation by means of compact models directly from the thermal response of the final device enables more robust and precise thermal control strategies as well as automated diagnosis. However, when dealing with large scale systems "in production", the accuracy of learned thermal models depends on the dynamics of the power excitation, which depends also on the executed workload, and measurement nonidealities, such as quantization. In this paper we show that, using an advanced system identification algorithm, we are able to generate very accurate thermal models (average error lower than our sensors quantization step of 1{\deg}C) for a large scale HPC system on real workloads for very long time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.