Leveraging Hardware Performance Counters for Predicting Workload Interference in Vector Supercomputers
Shubham, Keichi Takahashi, and Hiroyuki Takizawa

TL;DR
This paper presents a machine learning-based approach using hardware performance counters to predict workload interference in heterogeneous vector supercomputers, aiming to improve resource management and system efficiency.
Contribution
It introduces a novel predictive model that leverages hardware performance counters and machine learning to forecast workload interference in SX-AT systems.
Findings
Model accurately predicts performance degradation due to resource contention.
Hardware performance counters effectively inform interference prediction.
Insights support improved job scheduling and resource allocation strategies.
Abstract
In the rapidly evolving domain of high-performance computing (HPC), heterogeneous architectures such as the SX-Aurora TSUBASA (SX-AT) system architecture, which integrate diverse processor types, present both opportunities and challenges for optimizing resource utilization. This paper investigates workload interference within an SX-AT system, with a specific focus on resource contention between Vector Hosts (VHs) and Vector Engines (VEs). Through comprehensive empirical analysis, the study identifies key factors contributing to performance degradation, such as cache and memory bandwidth contention, when jobs with varying computational demands share resources. To address these issues, we develop a predictive model that leverages hardware performance counters (HCs) and machine learning (ML) algorithms to classify and predict workload interference. Our results demonstrate that the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
