The MIT Supercloud Workload Classification Challenge

Benny J. Tang; Qiqi Chen; Matthew L. Weiss; Nathan Frey; Joseph; McDonald; David Bestor; Charles Yee; William Arcand; Chansup Byun; Daniel; Edelman; Matthew Hubbell; Michael Jones; Jeremy Kepner; Anna Klein; Adam; Michaleas; Peter Michaleas; Lauren Milechin; Julia Mullen; Andrew Prout,; Albert Reuther; Antonio Rosa; Andrew Bowne; Lindsey McEvoy; Baolin Li; Devesh; Tiwari; Vijay Gadepally; Siddharth Samsi

arXiv:2204.05839·cs.DC·September 12, 2022

The MIT Supercloud Workload Classification Challenge

Benny J. Tang, Qiqi Chen, Matthew L. Weiss, Nathan Frey, Joseph, McDonald, David Bestor, Charles Yee, William Arcand, Chansup Byun, Daniel, Edelman, Matthew Hubbell, Michael Jones, Jeremy Kepner, Anna Klein, Adam, Michaleas, Peter Michaleas, Lauren Milechin, Julia Mullen

PDF

TL;DR

This paper introduces a workload classification challenge using the MIT Supercloud Dataset to improve AI and ML workload identification for better resource management in HPC and cloud environments.

Contribution

It provides a labeled dataset and initial results to foster new algorithms for workload classification in heterogeneous datacenter environments.

Findings

01

Initial classification results demonstrate potential for improved accuracy.

02

The dataset enables development of AI-based workload identification methods.

03

Public availability of data and code supports further research.

Abstract

High-Performance Computing (HPC) centers and cloud providers support an increasingly diverse set of applications on heterogenous hardware. As Artificial Intelligence (AI) and Machine Learning (ML) workloads have become an increasingly larger share of the compute workloads, new approaches to optimized resource usage, allocation, and deployment of new AI frameworks are needed. By identifying compute workloads and their utilization characteristics, HPC systems may be able to better match available resources with the application demand. By leveraging datacenter instrumentation, it may be possible to develop AI-based approaches that can identify workloads and provide feedback to researchers and datacenter operators for improving operational efficiency. To enable this research, we released the MIT Supercloud Dataset, which provides detailed monitoring logs from the MIT Supercloud cluster.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.