A Survey of fault models and fault tolerance methods for 2D bus-based multi-core systems and TSV based 3D NOC many-core systems
Shashikiran Venkatesha, Ranjani Parthasarathi

TL;DR
This survey comprehensively reviews fault models, failure mechanisms, and fault tolerance techniques across 2D bus-based multi-core and 3D TSV-based NOC many-core systems, highlighting recent advances and challenges.
Contribution
It provides an integrated overview of fault tolerance methods from logic to hardware levels for both 2D and 3D multi-core architectures, including novel insights into TSV-based 3D NOC systems.
Findings
Analysis of fault models and failure mechanisms in multi-core systems
Evaluation of fault mitigation techniques at various system layers
Discussion on defect tolerance and diagnosis methods for 3D NOC systems
Abstract
Reliability has taken centre stage in the development of high-performance computing processors. A Surge of interest is noticeable in recent times in formulating fault and failure models, understanding failure mechanism and strategizing fault mitigation methods for improving the reliability of the system. The article presents a congregation of concepts illustrated one after the other for a better understanding of damages caused by radiation, relevant fault models, and effects of faults. We examine the state of art fault mitigation techniques at the logical layer for digital CMOS based design and SRAM based FPGA. CMOS SRAM structure is the same for both digital CMOS and FPGA. Understanding of resilient SRAM based FPGA is necessary for developing resilient prototypes and it facilitates a faster integration of digital CMOS designs. At the micro-architectural and architectural layer, error…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Integrated Circuits and Semiconductor Failure Analysis · Radiation Effects in Electronics
