Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure
Rui Xie, Asad Ul Haq, Yunhua Fang, Linsen Ma, Sanchari Sen, Swagath Venkataramani, Liu Liu, Tong Zhang

TL;DR
This paper proposes a system-level, domain-specific ECC framework that eliminates on-die ECC in HBM, significantly reducing costs while maintaining high throughput and accuracy for AI inference workloads.
Contribution
It introduces a novel ECC approach combining large-codeword Reed--Solomon correction with CRC detection and differential parity, enabling cost-effective, tunable reliability for HBM in AI systems.
Findings
Retains over 78% throughput at error rates up to 10^{-3}
Maintains 97% model accuracy despite high bit error rates
Enables cost reduction by shifting fault management to memory controller
Abstract
High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, poses a growing barrier to scalable deployment. This work explores a system-level approach to cost reduction by eliminating on-die ECC and shifting all fault management to the memory controller. We introduce a domain-specific ECC framework combining large-codeword Reed--Solomon~(RS) correction with lightweight fine-grained CRC detection, differential parity updates to mitigate write amplification, and tunable protection based on data importance. Our evaluation using LLM inference workloads shows that, even under raw HBM bit error rates up to , the system retains over 78\% of throughput and 97\% of model accuracy compared with systems equipped with ideal error-free HBM. By treating reliability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
