Highly Efficient Memory Failure Prediction using Mcelog-based Data Mining and Machine Learning
Chengdong Yao

TL;DR
This paper presents a highly efficient machine learning approach for predicting memory failures in data centers using Mcelog-based data mining, addressing challenges like data noise and class imbalance, with competitive results in a major competition.
Contribution
It introduces a novel, fast, and accurate memory failure prediction model that outperforms existing solutions and is suitable for real-time deployment in data centers.
Findings
Achieved top 14th place in Alibaba Cloud AIOps Competition
Model passes online test in 30 minutes, much faster than competitors
Open-sourced code for community use
Abstract
In the data center, unexpected downtime caused by memory failures can lead to a decline in the stability of the server and even the entire information technology infrastructure, which harms the business. Therefore, whether the memory failure can be accurately predicted in advance has become one of the most important issues to be studied in the data center. However, for the memory failure prediction in the production system, it is necessary to solve technical problems such as huge data noise and extreme imbalance between positive and negative samples, and at the same time ensure the long-term stability of the algorithm. This paper compares and summarizes some commonly used skills and the improvement they can bring. The single model we proposed won the top 14th in the 2nd Alibaba Cloud AIOps Competition belonging to the 25th PAKDD conference. It takes only 30 minutes to pass the online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · VLSI and Analog Circuit Testing
