On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster
Michael Rogenmoser, Nils Wistoff, Pirmin Vogel, Frank G\"urkaynak,, Luca Benini

TL;DR
This paper presents a flexible architectural scheme called On-Demand Redundancy Grouping (ODRG) for multicore clusters, enabling run-time soft-error tolerance with minimal overhead and performance benefits.
Contribution
It introduces a novel run-time configurable fault-tolerance scheme for multicore clusters, allowing dynamic switching between fault-tolerant and high-performance modes.
Findings
Less than 11% core area overhead for redundancy grouping
Negligible timing increase with the ODRG scheme
2.5× faster fault recovery compared to state-of-the-art
Abstract
With the shrinking of technology nodes and the use of parallel processor clusters in hostile and critical environments, such as space, run-time faults caused by radiation are a serious cross-cutting concern, also impacting architectural design. This paper introduces an architectural approach to run-time configurable soft-error tolerance at the core level, augmenting a six-core open-source RISC-V cluster with a novel On-Demand Redundancy Grouping (ODRG) scheme. ODRG allows the cluster to operate either as two fault-tolerant cores, or six individual cores for high-performance, with limited overhead to switch between these modes during run-time. The ODRG unit adds less than 11% of a core's area for a three-core group, or a total of 1% of the cluster area, and shows negligible timing increase, which compares favorably to a commercial state-of-the-art implementation, and is 2.5…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
