Predictive Performance Modeling for Distributed Computing using Black-Box Monitoring and Machine Learning
Carl Witt, Marc Bux, Wladislaw Gusew, Ulf Leser

TL;DR
This paper reviews black-box predictive performance modeling techniques for distributed computing, emphasizing machine learning approaches that estimate job performance metrics without workload modification, addressing challenges in scheduling and resource management.
Contribution
It provides a comprehensive classification and comparison of non-intrusive performance prediction methods and highlights open research challenges in the field.
Findings
Non-intrusive methods can predict performance metrics effectively.
Various machine learning techniques are applicable for performance modeling.
Identified key open problems and future research directions.
Abstract
In many domains, the previous decade was characterized by increasing data volumes and growing complexity of computational workloads, creating new demands for highly data-parallel computing in distributed systems. Effective operation of these systems is challenging when facing uncertainties about the performance of jobs and tasks under varying resource configurations, e.g., for scheduling and resource allocation. We survey predictive performance modeling (PPM) approaches to estimate performance metrics such as execution duration, required memory or wait times of future jobs and tasks based on past performance observations. We focus on non-intrusive methods, i.e., methods that can be applied to any workload without modification, since the workload is usually a black-box from the perspective of the systems managing the computational infrastructure. We classify and compare sources of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
