Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer   Exploration for Architecting Resilience)

Eric Cheng; Shahrzad Mirkhani; Lukasz G. Szafaryn; Chen-Yong Cher,; Hyungmin Cho; Kevin Skadron; Mircea R. Stan; Klas Lilja; Jacob A. Abraham,; Pradip Bose; and Subhasish Mitra

arXiv:1709.09921·cs.AR·September 29, 2017

Tolerating Soft Errors in Processor Cores Using CLEAR (Cross-Layer Exploration for Architecting Resilience)

Eric Cheng, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen-Yong Cher,, Hyungmin Cho, Kevin Skadron, Mircea R. Stan, Klas Lilja, Jacob A. Abraham,, Pradip Bose, and Subhasish Mitra

PDF

TL;DR

This paper introduces CLEAR, a framework that systematically explores and optimizes cross-layer resilience techniques to protect processor cores from soft errors with minimal cost, achieving significant reliability improvements.

Contribution

The paper presents the first systematic exploration of cross-layer resilience techniques for soft error tolerance in processor cores, optimizing combinations for cost-effectiveness.

Findings

01

A combination of circuit hardening, parity checking, and recovery yields high resilience at low cost.

02

Selective circuit hardening guided by application analysis provides a cost-effective solution.

03

Achieves 50x silent data corruption rate improvement with minimal energy overhead.

Abstract

We present CLEAR (Cross-Layer Exploration for Architecting Resilience), a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (586 cross-layer combinations in this paper),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.