dynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems
Niyazi Sorkunlu, Duc Thanh Anh Luong, Varun Chandola

TL;DR
This paper introduces dynamicMF, a tensor-based matrix factorization technique that reduces high-dimensional HPC resource data to low-dimensional signals for real-time system monitoring and anomaly detection.
Contribution
The paper presents a novel dynamic matrix factorization method for low-dimensional representation of HPC resource data, enabling efficient anomaly detection.
Findings
Identified anomalies correlate with actual system events.
Effective reduction of multi-dimensional data to low-dimensional signals.
Improved real-time monitoring capabilities for HPC systems.
Abstract
High performance computing (HPC) facilities consist of a large number of interconnected computing units (or nodes) that execute highly complex scientific simulations to support scientific research. Monitoring such facilities, in real-time, is essential to ensure that the system operates at peak efficiency. Such systems are typically monitored using a variety of measurement and log data which capture the state of the various components within the system at regular intervals of time. As modern HPC systems grow in capacity and complexity, the data produced by current resource monitoring tools is at a scale that it is no longer feasible to be visually monitored by analysts. We propose a method that transforms the multi-dimensional output of resource monitoring tools to a low dimensional representation that facilitates the understanding of the behavior of a High Performance Computing (HPC)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
