Python Workflows on HPC Systems
Dominik Strassel, Philipp Reusch, Janis Keuper

TL;DR
This paper discusses the challenges of running Python workflows on HPC systems, especially for deep learning on GPU clusters, and proposes solutions for managing environments, security, and resource control.
Contribution
It identifies key issues with Python on HPC and offers practical workarounds for environment management, security, and resource containment in multi-user, GPU-accelerated settings.
Findings
Identified challenges of Python in multi-user HPC environments.
Proposed solutions for environment management and security.
Focused on deep learning applications on GPU clusters.
Abstract
The recent successes and wide spread application of compute intensive machine learning and data analytics methods have been boosting the usage of the Python programming language on HPC systems. While Python provides many advantages for the users, it has not been designed with a focus on multi-user environments or parallel programming - making it quite challenging to maintain stable and secure Python workflows on a HPC system. In this paper, we analyze the key problems induced by the usage of Python on HPC clusters and sketch appropriate workarounds for efficiently maintaining multi-user Python software environments, securing and restricting resources of Python jobs and containing Python processes, while focusing on Deep Learning applications running on GPU clusters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Scientific Computing and Data Management · Advanced Data Storage Technologies
