RepTFD: Replay Based Transient Fault Detection
Lei Li, Tianshi Chen, Yunji Chen, Ling Li, and Ruiyang Wu

TL;DR
RepTFD introduces a novel core-group replay-based transient fault detection scheme that achieves 100% coverage with minimal performance and area overhead in modern chip multiprocessors.
Contribution
It is the first scheme to provide comprehensive core-level transient fault detection by replaying groups of cores, significantly reducing overhead compared to previous methods.
Findings
Achieves 100% transient fault coverage.
Only 4.76% performance overhead observed.
Consumes about 0.83% of chip area.
Abstract
The advances in IC process make future chip multiprocessors (CMPs) more and more vulnerable to transient faults. To detect transient faults, previous core-level schemes provide redundancy for each core separately. As a result, they may leave transient faults in the uncore parts, which consume over 50% area of a modern CMP, escaped from detection. This paper proposes RepTFD, the first core-level transient fault detection scheme with 100% coverage. Instead of providing redundancy for each core separately, RepTFD provides redundancy for a group of cores as a whole. To be specific, it replays the execution of the checked group of cores on a redundant group of cores. Through comparing the execution results between the two groups of cores, all malignant transient faults can be caught. Moreover, RepTFD adopts a novel pending period based record-replay approach, which can greatly reduce the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · VLSI and Analog Circuit Testing · Interconnection Networks and Systems
