Investigating Memory Failure Prediction Across CPU Architectures

Qiao Yu; Wengui Zhang; Min Zhou; Jialiang Yu; Zhenli Sheng; Jasmin; Bogatinovski; Jorge Cardoso; Odej Kao

arXiv:2406.05354·cs.AR·December 17, 2024

Investigating Memory Failure Prediction Across CPU Architectures

Qiao Yu, Wengui Zhang, Min Zhou, Jialiang Yu, Zhenli Sheng, Jasmin, Bogatinovski, Jorge Cardoso, Odej Kao

PDF

TL;DR

This paper explores how memory failure prediction varies across CPU architectures like X86 and ARM, using machine learning to improve prediction accuracy and proposing an MLOps framework for deployment.

Contribution

It investigates architecture-specific memory failure patterns and enhances prediction accuracy with ML, introducing an MLOps framework for production use.

Findings

01

Up to 15% improvement in F1-score over existing algorithms.

02

Identified unique memory failure patterns per CPU architecture.

03

Developed an MLOps framework for continuous failure prediction improvement.

Abstract

Large-scale datacenters often experience memory failures, where Uncorrectable Errors (UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing approaches primarily utilize Correctable Errors (CEs) to predict UEs, yet they typically neglect how these errors vary between different CPU architectures, especially in terms of Error Correction Code (ECC) applicability. In this paper, we investigate the correlation between CEs and UEs across different CPU architectures, including X86 and ARM. Our analysis identifies unique patterns of memory failure associated with each processor platform. Leveraging Machine Learning (ML) techniques on production datasets, we conduct the memory failure prediction in different processors' platforms, achieving up to 15% improvements in F1-score compared to the existing algorithm. Finally, an MLOps (Machine Learning Operations) framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.