Enabling Effective Error Mitigation in Memory Chips That Use On-Die Error-Correcting Codes
Minesh Patel

TL;DR
This paper investigates how on-die ECC in memory chips obscures error characteristics, proposing new testing techniques to recover and understand these errors for improved mitigation and system reliability.
Contribution
It provides a detailed analysis of on-die ECC obfuscation effects and introduces novel testing methods to reveal underlying memory error behaviors.
Findings
On-die ECC significantly obfuscates error characteristics.
New testing techniques can recover error statistics despite ECC.
Understanding error mechanisms enables better error mitigation strategies.
Abstract
Improvements in main memory storage density are primarily driven by process technology scaling, which negatively impacts reliability by exacerbating various circuit-level error mechanisms. To compensate for growing error rates, both memory manufacturers and consumers use error-mitigation mechanisms that improve manufacturing yield and allow system designers to meet reliability targets. Developing effective error mitigations requires understanding the errors' characteristics (e.g., worst-case behavior, statistical properties). Unfortunately, we observe that proprietary on-die Error-Correcting Codes (ECC) used in modern memory chips introduce new challenges to efficient error mitigation by obfuscating CPU-visible error characteristics in an unpredictable, ECC-dependent manner. This dissertation builds a detailed understanding of how on-die ECC obfuscates the statistical properties of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
