I/O Burst Prediction for HPC Clusters using Darshan Logs
Ehsan Saeedizade, Roya Taheri, Engin Arslan

TL;DR
This paper presents machine learning models trained on Darshan logs to predict system-level I/O bursts in HPC clusters with high accuracy, enabling improved scheduling and reduced application runtimes.
Contribution
It introduces a novel approach to predict I/O bursts at the system level using machine learning on Darshan logs, with practical validation through a burst-aware scheduler.
Findings
Over 100x fluctuations in I/O rates observed.
Predictions achieve over 90% accuracy five minutes ahead.
Burst-aware scheduling reduces application runtime by up to 5x.
Abstract
Understanding cluster-wide I/O patterns of large-scale HPC clusters is essential to minimize the occurrence and impact of I/O interference. Yet, most previous work in this area focused on monitoring and predicting task and node-level I/O burst events. This paper analyzes Darshan reports from three supercomputers to extract system-level read and write I/O rates in five minutes intervals. We observe significant (over 100x) fluctuations in read and write I/O rates in all three clusters. We then train machine learning models to estimate the occurrence of system-level I/O bursts 5 - 120 minutes ahead. Evaluation results show that we can predict I/O bursts with more than 90% accuracy (F-1 score) five minutes ahead and more than 87% accuracy two hours ahead. We also show that the ML models attain more than 70% accuracy when estimating the degree of the I/O burst. We believe that high-accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
