A Taxonomy of Error Sources in HPC I/O Machine Learning Models
Mihailo Isakov, Mikaela Currier, Eliakin del Rosario, Sandeep, Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, Glenn K., Lockwood, Michel A. Kinsy

TL;DR
This paper analyzes why machine learning models for HPC I/O throughput underperform in practice, proposing a taxonomy of error sources and tests to improve model robustness and system understanding.
Contribution
It introduces a taxonomy of five error categories in HPC I/O modeling and develops litmus tests to diagnose and address these errors.
Findings
Identified five key error categories affecting I/O model performance.
Developed litmus tests to quantify and diagnose modeling errors.
Provided insights to enhance future HPC I/O modeling tools.
Abstract
I/O efficiency is crucial to productivity in scientific computing, but the increasing complexity of the system and the applications makes it difficult for practitioners to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and underperform after being deployed. We analyze multiple years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models underperform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems
