A Survey of fault mitigation techniques for multi-core architectures
Shashikiran Venkatesha, Ranjani Parthasarathi

TL;DR
This survey reviews fault mitigation techniques for multi-core architectures, analyzing methods like detection, recovery, and reconfigurability, and discusses their impact on reliability, performance, and future research directions.
Contribution
It provides a comprehensive overview of existing fault tolerant approaches, critically examines literature, and suggests new research avenues for both homogeneous and heterogeneous multi-core systems.
Findings
Fault tolerant methods improve multi-core reliability.
Trade-offs exist between performance, area, and fault coverage.
Analytical models help understand fault tolerance impacts.
Abstract
Fault tolerance in multi-core architecture has attracted attention of research community for the past 20 years. Rapid improvements in the CMOS technology resulted in exponential growth of transistor density. It resulted in increased challenges for designing resilient multi-core architecture at the same pace. The article presents a survey of fault tolerant methods like fault detection, recovery, re-configurability and repair techniques for multi-core architectures. Salvaging at micro-architectural and architectural level are also discussed. Gamut of fault tolerant approaches discussed in this article have tangible improvements on the reliability of the multi-core architectures. Every concept in the seminal articles is examined with respect to relevant metrics like performance cost, area overhead, fault coverage, level of protection, detection latency and Mean Time To Failure. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Distributed systems and fault tolerance · Interconnection Networks and Systems
