M$^2$-MFP: A Multi-Scale and Multi-Level Memory Failure Prediction Framework for Reliable Cloud Infrastructure

Hongyi Xie; Min Zhou; Qiao Yu; Jialiang Yu; Zhenli Sheng; Hong Xie; Defu Lian

arXiv:2507.07144·cs.DC·July 11, 2025

M$^2$-MFP: A Multi-Scale and Multi-Level Memory Failure Prediction Framework for Reliable Cloud Infrastructure

Hongyi Xie, Min Zhou, Qiao Yu, Jialiang Yu, Zhenli Sheng, Hong Xie, Defu Lian

PDF

TL;DR

M$^2$-MFP is a novel hierarchical framework that enhances memory failure prediction in cloud systems by combining multi-level feature extraction and interpretable temporal modeling, significantly outperforming existing methods.

Contribution

The paper introduces M$^2$-MFP, a multi-scale, hierarchical prediction framework that automatically extracts high-order features and employs dual-path temporal modeling for improved reliability.

Findings

01

Outperforms state-of-the-art methods on benchmark datasets.

02

Effective in real-world cloud infrastructure deployment.

03

Significantly higher recall and accuracy in failure prediction.

Abstract

As cloud services become increasingly integral to modern IT infrastructure, ensuring hardware reliability is essential to sustain high-quality service. Memory failures pose a significant threat to overall system stability, making accurate failure prediction through the analysis of memory error logs (i.e., Correctable Errors) imperative. Existing memory failure prediction approaches have notable limitations: rule-based expert models suffer from limited generalizability and low recall rates, while automated feature extraction methods exhibit suboptimal performance. To address these limitations, we propose M $^{2}$ -MFP: a Multi-scale and hierarchical memory failure prediction framework designed to enhance the reliability and availability of cloud infrastructure. M $^{2}$ -MFP converts Correctable Errors (CEs) into multi-level binary matrix representations and introduces a Binary Spatial Feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.