The MIT Supercloud Dataset
Siddharth Samsi, Matthew L Weiss, David Bestor, Baolin Li, Michael, Jones, Albert Reuther, Daniel Edelman, William Arcand, Chansup Byun, John, Holodnack, Matthew Hubbell, Jeremy Kepner, Anna Klein, Joseph McDonald, Adam, Michaleas, Peter Michaleas, Lauren Milechin, Julia Mullen

TL;DR
The MIT Supercloud Dataset provides detailed logs of HPC and datacenter operations to support AI/ML research on resource management, energy efficiency, and failure prediction in large-scale computing environments.
Contribution
It introduces a comprehensive dataset capturing system metrics and logs from the MIT Supercloud to enable innovative AI-driven analysis of HPC and cloud operations.
Findings
Dataset includes CPU, GPU, memory, and file system logs.
Facilitates development of AI models for resource optimization.
Supports research on failure prediction and policy violations.
Abstract
Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frame- works, and capabilities such as Jupyter notebooks to enable rapid prototyping and deployment. With these changes, there is a need to better understand cluster/datacenter operations with the goal of developing improved scheduling policies, identifying inefficiencies in resource utilization, energy/power consumption, failure prediction, and identifying policy violations. In this paper we introduce the MIT Supercloud Dataset which aims to foster innovative AI/ML approaches to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Graph Theory and Algorithms
