Making Strong Error-Correcting Codes Work Effectively for HBM in AI Inference
Rui Xie, Yunhua Fang, Asad Ul Haq, Linsen Ma, Sanchari Sen, Swagath Venkataramani, Liu Liu, Tong Zhang

TL;DR
This paper introduces REACH, a controller-managed ECC scheme for HBM in AI inference that tolerates higher error rates, reduces costs, and maintains throughput without hardware changes.
Contribution
REACH employs a two-level Reed-Solomon scheme managed by the controller, enabling high error tolerance and cost reduction in HBM for AI inference.
Findings
REACH maintains 79% of ECC throughput at zero BER.
REACH is qualified up to a raw BER of 1e-3.
REACH reduces ECC area by 11.6x and power by 60%.
Abstract
LLM inference is increasingly memory bound, and HBM cost per GB dominates system cost. Current HBM stacks include short on-die ECC that tightens binning, raises price, and fixes reliability policy inside the device. This paper asks whether a system can tolerate a much higher raw HBM bit error rate and still keep end-to-end correctness and throughput, without changing the HBM PHY or the fixed 32 B transaction size. We propose REACH, a controller managed ECC design that keeps the HBM link and 32 B transfers unchanged. REACH uses a two level Reed-Solomon scheme: each 32 B chunk uses an inner code to check and correct most faults locally, while chunks that cannot be fixed are marked as erasures. An outer code spans kilobytes and runs in erasure only mode, repairing only flagged chunks and avoiding the expensive locator step. For small random writes, REACH updates outer parity with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Advanced Data Storage Technologies · Parallel Computing and Optimization Techniques
