Energy-aware checkpointing of divisible tasks with soft or hard deadlines
Guillaume Aupy, Anne Benoit, Rami Melhem, Paul Renaud-Goud, and Yves Robert

TL;DR
This paper investigates energy-efficient checkpointing strategies for divisible workloads under soft and hard deadlines, optimizing chunking, sizing, and re-execution speeds to minimize expected energy consumption.
Contribution
It introduces new models for energy-aware checkpointing with deadlines, providing exact solutions or optimization functions, and compares these models through extensive experiments.
Findings
Optimal checkpointing strategies reduce energy consumption under deadlines.
Different models perform variably depending on deadline strictness.
Proposed solutions effectively balance energy use and resilience.
Abstract
In this paper, we aim at minimizing the energy consumption when executing a divisible workload under a bound on the total execution time, while resilience is provided through checkpointing. We discuss several variants of this multi-criteria problem. Given the workload, we need to decide how many chunks to use, what are the sizes of these chunks, and at which speed each chunk is executed. Furthermore, since a failure may occur during the execution of a chunk, we also need to decide at which speed a chunk should be re-executed in the event of a failure. The goal is to minimize the expectation of the total energy consumption, while enforcing a deadline on the execution time, that should be met either in expectation (soft deadline), or in the worst case (hard deadline). For each problem instance, we propose either an exact solution, or a function that can be optimized numerically. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Real-Time Systems Scheduling · Distributed and Parallel Computing Systems
