Heterogeneity-aware Fault Tolerance using a Self-Organizing Runtime   System

Mario Kicherer; Wolfgang Karl

arXiv:1405.2912·cs.OS·May 14, 2014·1 cites

Heterogeneity-aware Fault Tolerance using a Self-Organizing Runtime System

Mario Kicherer, Wolfgang Karl

PDF

Open Access

TL;DR

This paper presents a self-organizing runtime system that exploits heterogeneity in computing units for fault tolerance through dynamic task mapping, duplication, and majority voting, enhancing reliability in diverse systems.

Contribution

It introduces a novel extension to an online-learning runtime that combines performance optimization with fault tolerance by leveraging heterogeneity-aware strategies and dynamic benefit assessment.

Findings

01

Improved fault detection and tolerance in heterogeneous systems.

02

Effective dynamic resource management and data transfer mechanisms.

03

Enhanced system reliability without significant performance loss.

Abstract

Due to the diversity and implicit redundancy in terms of processing units and compute kernels, off-the-shelf heterogeneous systems offer the opportunity to detect and tolerate faults during task execution in hardware as well as in software. To automatically leverage this diversity, we introduce an extension of an online-learning runtime system that combines the benefits of the existing performance-oriented task mapping with task duplication, a diversity-oriented mapping strategy and heterogeneity-aware majority voter. This extension uses a new metric to dynamically rate the remaining benefit of unreliable processing units and a memory management mechanism for automatic data transfers and checkpointing in the host and device memories.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiation Effects in Electronics · Advanced Memory and Neural Computing · Distributed systems and fault tolerance