The RECIPE Approach to Challenges in Deeply Heterogeneous High Performance Systems
Giovanni Agosta, William Fornaciari, David Atienza, Ramon Canal,, Alessandro Cilardo, Jos\'e Flich Cardo, Carles Hernandez Luz, Michal, Kulczewski, Giuseppe Massari, Rafael Tornero Gavil\'a, Marina Zapater

TL;DR
RECIPE introduces a hierarchical resource management framework for heterogeneous HPC systems to optimize energy use, ensure reliability, and meet application time constraints, with predictive models for thermal and reliability management.
Contribution
The paper presents a novel hierarchical management approach that integrates predictive reliability and thermal models to improve performance and hardware longevity in exascale heterogeneous systems.
Findings
Prediction accuracy significantly affects checkpointing overheads.
Hierarchical management improves energy efficiency and reliability.
Application to weather forecasting demonstrates practical benefits.
Abstract
RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximize hardware lifetime and guarantee…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
