Understanding Large-Scale HPC System Behavior Through Cluster-Based Visual Analytics
Allison Austin, Shilpika, Yan To Linus Lam, Yun-Hsin Kuo, Venkatram Vishwanath, Michael E. Papka, Kwan-Liu Ma

TL;DR
This paper introduces a scalable visual analytics system for exploring and understanding behaviors of compute nodes in large HPC systems, aiding anomaly detection and interpretation.
Contribution
It presents an integrated analysis workflow combining dimensionality reduction, contrastive learning, and dynamic mode decomposition within an interactive visualization interface.
Findings
Automatically identified meaningful node clusters.
Revealed subtle behavioral differences within and across node groups.
Expert feedback confirmed improved anomaly detection and interpretation.
Abstract
In high-performance computing (HPC) environments, system monitoring data is often unlabeled and high-dimensional, making it difficult to reliably detect and understand anomalous computing nodes. The growing scale and dimensionality of the collected datasets present significant challenges for analysis and visualization tasks. We present a scalable, interactive visual analytics system to support exploration, explanation, and comparison of compute node behaviors in HPC systems. Our approach integrates an analysis workflow combining two-phase dimensionality reduction with contrastive learning and multi-resolution dynamic mode decomposition to capture inter- and intra-cluster variations. These analyses are embedded in an interactive interface that enables users to explore clusters, compare temporal patterns, and iteratively refine hypotheses through customizable visual encodings and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
