Deploying Scientific AI Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers
David Brayford, Sofia Vallercorsa

TL;DR
This paper demonstrates the deployment of a standard machine learning framework on a secure petaflop-scale HPC system to train a complex 3D convolutional GAN for high energy physics simulations.
Contribution
It presents a successful approach to deploying ML frameworks at petaflop scale on secure HPC systems, enabling large-scale scientific AI applications.
Findings
Achieved petaflop performance training of 3DGAN
Demonstrated deployment on secure large-scale HPC systems
Enabled advanced scientific simulations in high energy physics
Abstract
There is an ever-increasing need for computational power to train complex artificial intelligence (AI) & machine learning (ML) models to tackle large scientific problems. High performance computing (HPC) resources are required to efficiently compute and scale complex models across tens of thousands of compute nodes. In this paper, we discuss the issues associated with the deployment of machine learning frameworks on large scale secure HPC systems and how we successfully deployed a standard machine learning framework on a secure large scale HPC production system, to train a complex three-dimensional convolutional GAN (3DGAN), with petaflop performance. 3DGAN is an example from the high energy physics domain, designed to simulate the energy pattern produced by showers of secondary particles inside a particle detector on various HPC systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParticle Detector Development and Performance · Advanced Data Storage Technologies · Particle physics theoretical and experimental studies
