Precision-Aware application execution for Energy-optimization in HPC node system
Radim Vav\v{r}\'ik, Antoni Portero, \v{S}t\v{e}p\'an Kucha\v{r},, Martin Golasowski, Simone Libutti, Giuseppe Massari, William Fornaciari,, V\'it Vondr\'ak

TL;DR
This paper presents a precision-aware energy optimization approach for HPC systems, using a real-time resource manager to balance power consumption, performance, and application precision in disaster management scenarios.
Contribution
It introduces a runtime system that dynamically adjusts application precision and resource allocation based on disaster risk, optimizing energy use without significant performance loss.
Findings
Model execution improves precision by 65% on average.
Increasing iterations from 1e3 to 1e4 extends execution time by an order of magnitude.
RTOS computes optimal precision-performance trade-offs with less than 10% overhead.
Abstract
Power consumption is a critical consideration in high performance computing systems and it is becoming the limiting factor to build and operate Petascale and Exascale systems. When studying the power consumption of existing systems running HPC workloads, we find that power, energy and performance are closely related which leads to the possibility to optimize energy consumption without sacrificing (much or at all) the performance. In this paper, we propose a HPC system running with a GNU/Linux OS and a Real Time Resource Manager (RTRM) that is aware and monitors the healthy of the platform. On the system, an application for disaster management runs. The application can run with different QoS depending on the situation. We defined two main situations. Normal execution, when there is no risk of a disaster, even though we still have to run the system to look ahead in the near future if the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
