Deploying Scientific AI Networks at Petaflop Scale on Secure Large Scale   HPC Production Systems with Containers

David Brayford; Sofia Vallercorsa

arXiv:2005.10676·cs.DC·May 22, 2020

Deploying Scientific AI Networks at Petaflop Scale on Secure Large Scale HPC Production Systems with Containers

David Brayford, Sofia Vallercorsa

PDF

Open Access

TL;DR

This paper demonstrates the deployment of a standard machine learning framework on a secure petaflop-scale HPC system to train a complex 3D convolutional GAN for high energy physics simulations.

Contribution

It presents a successful approach to deploying ML frameworks at petaflop scale on secure HPC systems, enabling large-scale scientific AI applications.

Findings

01

Achieved petaflop performance training of 3DGAN

02

Demonstrated deployment on secure large-scale HPC systems

03

Enabled advanced scientific simulations in high energy physics

Abstract

There is an ever-increasing need for computational power to train complex artificial intelligence (AI) & machine learning (ML) models to tackle large scientific problems. High performance computing (HPC) resources are required to efficiently compute and scale complex models across tens of thousands of compute nodes. In this paper, we discuss the issues associated with the deployment of machine learning frameworks on large scale secure HPC systems and how we successfully deployed a standard machine learning framework on a secure large scale HPC production system, to train a complex three-dimensional convolutional GAN (3DGAN), with petaflop performance. 3DGAN is an example from the high energy physics domain, designed to simulate the energy pattern produced by showers of secondary particles inside a particle detector on various HPC systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParticle Detector Development and Performance · Advanced Data Storage Technologies · Particle physics theoretical and experimental studies