Heterogeneity-aware Fault Tolerance using a Self-Organizing Runtime System
Mario Kicherer, Wolfgang Karl

TL;DR
This paper presents a self-organizing runtime system that exploits heterogeneity in computing units for fault tolerance through dynamic task mapping, duplication, and majority voting, enhancing reliability in diverse systems.
Contribution
It introduces a novel extension to an online-learning runtime that combines performance optimization with fault tolerance by leveraging heterogeneity-aware strategies and dynamic benefit assessment.
Findings
Improved fault detection and tolerance in heterogeneous systems.
Effective dynamic resource management and data transfer mechanisms.
Enhanced system reliability without significant performance loss.
Abstract
Due to the diversity and implicit redundancy in terms of processing units and compute kernels, off-the-shelf heterogeneous systems offer the opportunity to detect and tolerate faults during task execution in hardware as well as in software. To automatically leverage this diversity, we introduce an extension of an online-learning runtime system that combines the benefits of the existing performance-oriented task mapping with task duplication, a diversity-oriented mapping strategy and heterogeneity-aware majority voter. This extension uses a new metric to dynamically rate the remaining benefit of unreliable processing units and a memory management mechanism for automatic data transfers and checkpointing in the host and device memories.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Advanced Memory and Neural Computing · Distributed systems and fault tolerance
