Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool
Sungsoo Ha, Wonyong Jeong, Gyorgy Matyasfalvi, Cong Xie, Kevin Huck,, Jong Youl Choi, Abid Malik, Li Tang, Hubertus Van Dam, Line Pouchard, Wei Xu,, Shinjae Yoo, Nicholas D'Imperio, Kerstin Kleese Van Dam

TL;DR
Chimbuko is a pioneering real-time, distributed performance analysis framework for high-performance computing workflows, enabling scalable anomaly detection and data reduction to facilitate online performance diagnostics.
Contribution
It introduces the first online, distributed, scalable workflow-level performance trace analysis tool with real-time anomaly detection and visualization capabilities.
Findings
Effective real-time anomaly detection on Summit system
Significant data volume reduction without losing critical details
Enhanced online performance monitoring for complex workflows
Abstract
Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance trace data needed to detect potential problems. This work introduces Chimbuko, a performance analysis framework that provides real-time, distributed, in situ anomaly detection. Data volumes are reduced for human-level processing without losing necessary details. Chimbuko supports online performance monitoring via a visualization module that presents the overall workflow anomaly distribution, call stacks, and timelines. Chimbuko also supports the capture and reduction of performance provenance. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Scientific Computing and Data Management · Anomaly Detection Techniques and Applications
