Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows
Jonathan Bader, Nicolas Zunker, Soeren Becker, and Odej Kao

TL;DR
This paper introduces reinforcement learning methods to optimize resource allocation in scientific workflows, reducing wastage and improving efficiency compared to traditional and existing feedback-based approaches.
Contribution
It presents two novel RL approaches for resource allocation in scientific workflows, implemented in Nextflow, and demonstrates significant improvements over baseline methods.
Findings
Reinforcement learning approaches significantly reduce resource wastage.
Our methods decrease CPU hours compared to feedback loop baseline.
Approaches outperform default resource configurations.
Abstract
Scientific workflows are designed as directed acyclic graphs (DAGs) and consist of multiple dependent task definitions. They are executed over a large amount of data, often resulting in thousands of tasks with heterogeneous compute requirements and long runtimes, even on cluster infrastructures. In order to optimize the workflow performance, enough resources, e.g., CPU and memory, need to be provisioned for the respective tasks. Typically, workflow systems rely on user resource estimates which are known to be highly error-prone and can result in over- or underprovisioning. While resource overprovisioning leads to high resource wastage, underprovisioning can result in long runtimes or even failed tasks. In this paper, we propose two different reinforcement learning approaches based on gradient bandits and Q-learning, respectively, in order to minimize resource wastage by selecting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
