ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions
Ioanna Vavelidou, Subho S. Banerjee, Eric X. Liu, Mike Fuller, Subhasish Mitra, Caroline Trippel

TL;DR
ITHICA is a novel method that automatically generates functional tests at the instruction level to detect defect-induced silent data corruptions in CPUs, improving defect detection rates.
Contribution
It introduces an instruction duplication-based error checking approach that transforms arbitrary programs into effective tests for CPU defects.
Findings
Detects 39% more defective servers than native checks
Transforms industrial and workload programs into functional tests
Reveals new insights on defect behavior challenging prior studies
Abstract
Hyperscaler reports of silent data corruptions (SDCs), presumed to be caused by silicon manufacturing defects, have motivated the development of functional tests for detecting defective CPUs. We present ITHICA, an approach for automatically generating functional tests for defect-induced errors from arbitrary programs by inserting intra-thread, instruction-level error checks, primarily leveraging instruction duplication and output comparison. Our key insight is that the most pernicious defects cause inconsistent errors: two executions of the same instruction within the same thread, given the same inputs, can produce different architectural outputs depending on the execution context in which they run. By exploiting this insight, ITHICA enables arbitrary programs to serve as tests and identifies affected instructions upon error detections. We use ITHICA to transform industrial hyperscaler…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
