Comprehensive Resource Measurement and Analysis for HPC Systems with TACC_Stats
Charng-Da Lu

TL;DR
This paper introduces TACC_Stats, a job-oriented performance measurement tool for HPC systems, enabling more effective analysis of resource usage and system behavior from diverse data sources.
Contribution
The paper presents TACC_Stats, a novel, job-oriented performance monitoring system that simplifies resource analysis in complex HPC environments.
Findings
TACC_Stats effectively analyzes system performance data from the Ranger supercomputer.
It enables job-oriented analysis from disparate data sources.
Demonstrated usefulness through two case studies.
Abstract
High-performance computing (HPC) systems are a complex combination of software, processors, memory, networks, and storage systems characterized by frequent disruptive technological advances. Anomalous behavior has to be manually diagnosed and remedied with incomplete and sparse data. It also has been effort-intensive for users to assess the effectiveness with which they are using the available resources. The data available for system level analyses appear from multiple sources and in disparate formats (from Linux "sysstat" and accounting to scheduler/kernel logs). Sysstat does not resolve its measurements by job so that job-oriented analyses require individual measurements. There are many user-oriented performance instrumentation and profiling tools but they require extensive system knowledge, code changes and recompilation, and thus are not widely used. To address this issue, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Distributed systems and fault tolerance
