Towards observability of scientific applications
Bartosz Balis, Konrad Czerepak, Albert Kuzma, Jan Meizner, Lukasz, Wronski

TL;DR
This paper introduces a tailored observability solution for scientific applications in HPC environments, utilizing data analysis tools like DataFrames and Jupyter to improve monitoring and troubleshooting of complex scientific workflows.
Contribution
It presents an end-to-end observability framework specifically designed for scientific computing in HPC, addressing challenges like metrics collection, instrumentation, and context propagation.
Findings
Effective application-level metrics collection in HPC
DataFrames and Jupyter enhance analysis of scientific workflows
Solution successfully evaluated on medical scientific pipelines
Abstract
As software systems increase in complexity, conventional monitoring methods struggle to provide a comprehensive overview or identify performance issues, often missing unexpected problems. Observability, however, offers a holistic approach, providing methods and tools that gather and analyze detailed telemetry data to uncover hidden issues. Originally developed for cloud-native systems, modern observability is less prevalent in scientific computing, particularly in HPC clusters, due to differences in application architecture, execution environments, and technology stacks. This paper proposes and evaluates an end-to-end observability solution tailored for scientific computing in HPC environments. We address several challenges, including collection of application-level metrics, instrumentation, context propagation, and tracing. We argue that typical dashboards with charts are not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management
