LIKWID Monitoring Stack: A flexible framework enabling job specific performance monitoring for the masses
Thomas R\"ohl, Jan Eitzinger, Georg Hager, Gerhard Wellein

TL;DR
The LIKWID Monitoring Stack is a modular framework that enables job-specific performance monitoring in HPC systems by integrating hardware performance data, system metrics, and application data, enhancing system efficiency and user feedback.
Contribution
It introduces a flexible, modular framework that connects hardware performance monitoring with job information, facilitating job-aware performance analysis in HPC environments.
Findings
Supports integration with existing monitoring tools
Enables detailed, job-specific performance insights
Improves detection of performance issues
Abstract
System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological cases, provides instant performance feedback to the users, offers initial data to judge on the optimization potential of applications and helps to build a statistical foundation about application specific system usage. The LIKWID monitoring stack is a modular framework build on top of the LIKWID tools library. It aims on enabling job specific performance monitoring using HPM data, system metrics and application-level data for small to medium sized commodity clusters. Moreover, it is designed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
