Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
Marina Mor\'an, Javier Balladini, Dolores Rexachs, Enzo Rucci

TL;DR
This paper explores strategies to manage energy consumption in high-performance computing systems during failures, focusing on uncoordinated checkpointing and fault tolerance to improve energy efficiency.
Contribution
It introduces an energy model and simulation-based analysis of strategies to reduce energy use during failures in HPC systems with uncoordinated checkpointing.
Findings
Energy efficiency can be improved during failures.
Strategies effectively manage energy consumption in HPC.
Simulation results validate proposed methods.
Abstract
High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
