Balsam: Automated Scheduling and Execution of Dynamic, Data-Intensive HPC Workflows
Michael A. Salim, Thomas D. Uram, J. Taylor Childers, Prasanna, Balaprakash, Venkatram Vishwanath, Michael E. Papka

TL;DR
Balsam is a Python-based service that automates scheduling, execution, and fault-tolerance for dynamic, data-intensive workflows on supercomputers, improving resource utilization and reducing scripting overhead.
Contribution
It introduces a novel system that manages complex workflows dynamically without requiring user application modifications, enhancing efficiency and fault tolerance.
Findings
Efficient resource utilization demonstrated in case studies.
Reduced scripting overhead for workflow management.
Supports dynamic creation and termination of tasks at runtime.
Abstract
We introduce the Balsam service to manage high-throughput task scheduling and execution on supercomputing systems. Balsam allows users to populate a task database with a variety of tasks ranging from simple independent tasks to dynamic multi-task workflows. With abstractions for the local resource scheduler and MPI environment, Balsam dynamically packages tasks into ensemble jobs and manages their scheduling lifecycle. The ensembles execute in a pilot "launcher" which (i) ensures concurrent, load-balanced execution of arbitrary serial and parallel programs with heterogeneous processor requirements, (ii) requires no modification of user applications, (iii) is tolerant of task-level faults and provides several options for error recovery, (iv) stores provenance data (e.g task history, error logs) in the database, (v) supports dynamic workflows, in which tasks are created or killed at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Machine Learning in Materials Science · Cloud Computing and Resource Management
