TL;DR
HARP is a new error profiling algorithm designed to effectively identify all uncorrectable errors in memory chips with on-die ECC, overcoming obfuscation challenges and improving error detection speed and accuracy.
Contribution
HARP introduces a two-phase profiling approach that combines existing techniques with a secondary ECC to fully identify at-risk bits in on-die ECC memory chips.
Findings
HARP achieves significantly faster full coverage of at-risk bits compared to baseline algorithms.
HARP reduces system bit error rate more effectively than existing methods.
HARP's approach is robust across different error rates and configurations.
Abstract
State-of-the-art techniques for addressing scaling-related main memory errors identify and repair bits that are at risk of error from within the memory controller. Unfortunately, modern main memory chips internally use on-die error correcting codes (on-die ECC) that obfuscate the memory controller's view of errors, complicating the process of identifying at-risk bits (i.e., error profiling). To understand the problems that on-die ECC causes for error profiling, we analytically study how on-die ECC changes the way that memory errors appear outside of the memory chip (e.g., to the memory controller). We show that on-die ECC introduces statistical dependence between errors in different bit positions, raising three key challenges for practical and effective error profiling. To address the three challenges, we introduce Hybrid Active-Reactive Profiling (HARP), a new error profiling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
