Towards Advanced Monitoring for Scientific Workflows
Jonathan Bader, Joel Witzke, Soeren Becker, Ansgar L\"o{\ss}er, Fabian, Lehmann, Leon Doehler, Anh Duc Vu, and Odej Kao

TL;DR
This paper proposes a four-layer architectural blueprint for monitoring scientific workflows, aiming to improve the analysis of complex, distributed, and highly parallelized task executions by integrating diverse metrics.
Contribution
It introduces a novel four-layer monitoring architecture and evaluates existing workflow systems to facilitate comprehensive performance analysis.
Findings
Four monitoring layers effectively organize metrics and interactions.
Current systems lack integrated multi-layer monitoring capabilities.
The blueprint guides future development of monitoring tools for scientific workflows.
Abstract
Scientific workflows consist of thousands of highly parallelized tasks executed in a distributed environment involving many components. Automatic tracing and investigation of the components' and tasks' performance metrics, traces, and behavior are necessary to support the end user with a level of abstraction since the large amount of data cannot be analyzed manually. The execution and monitoring of scientific workflows involves many components, the cluster infrastructure, its resource manager, the workflow, and the workflow tasks. All components in such an execution environment access different monitoring metrics and provide metrics on different abstraction levels. The combination and analysis of observed metrics from different components and their interdependencies are still widely unregarded. We specify four different monitoring layers that can serve as an architectural blueprint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Business Process Modeling and Analysis
