Evolving HPC services to enable ML workloads on HPE Cray EX
Stefano Schuppli, Fawzi Mohamed, Henrique Mendon\c{c}a, Nina Mujkanovic, Elia Palme, Dino Conciatore, Lukas Drescher, Miguel Gila, Pim Witlox, Joost VandeVondele, Maxime Martinasso, Thomas C. Schulthess, Torsten Hoefler

TL;DR
This paper explores extending HPC services on the Alps infrastructure to better support machine learning workloads, addressing community needs through technological enhancements and infrastructure adaptations.
Contribution
It introduces specific technological improvements and infrastructure modifications to enable more effective execution of ML workloads on HPC systems like Alps.
Findings
Enhanced user environments facilitate ML adoption
Performance screening utilities improve application development
New storage and observability tools support large-scale ML workloads
Abstract
The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
