Understanding Soft Errors in Uncore Components
Hyungmin Cho, Chen-Yong Cher, Thomas Shepherd, Subhasish Mitra

TL;DR
This paper investigates the impact of soft errors in uncore components of SoCs, introduces a high-speed simulation platform, and proposes a replay recovery method that significantly enhances system reliability with minimal area and power overhead.
Contribution
It is the first to study soft errors in uncore components at the system level and to develop a replay recovery technique for these components in large-scale SoCs.
Findings
Soft errors in uncore components can greatly affect system reliability.
The new simulation platform achieves 20,000x speedup over RTL-only simulation.
The proposed replay recovery reduces application failure probability by over 100x.
Abstract
The effects of soft errors in processor cores have been widely studied. However, little has been published about soft errors in uncore components, such as memory subsystem and I/O controllers, of a System-on-a-Chip (SoC). In this work, we study how soft errors in uncore components affect system-level behaviors. We have created a new mixed-mode simulation platform that combines simulators at two different levels of abstraction, and achieves 20,000x speedup over RTL-only simulation. Using this platform, we present the first study of the system-level impact of soft errors inside various uncore components of a large-scale, multi-core SoC using the industrial-grade, open-source OpenSPARC T2 SoC design. Our results show that soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that uncore soft errors can create major challenges for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
