Leveraging Reinforcement Learning for Task Resource Allocation in   Scientific Workflows

Jonathan Bader; Nicolas Zunker; Soeren Becker; and Odej Kao

arXiv:2211.12076·cs.DC·July 19, 2023

Leveraging Reinforcement Learning for Task Resource Allocation in Scientific Workflows

Jonathan Bader, Nicolas Zunker, Soeren Becker, and Odej Kao

PDF

Open Access

TL;DR

This paper introduces reinforcement learning methods to optimize resource allocation in scientific workflows, reducing wastage and improving efficiency compared to traditional and existing feedback-based approaches.

Contribution

It presents two novel RL approaches for resource allocation in scientific workflows, implemented in Nextflow, and demonstrates significant improvements over baseline methods.

Findings

01

Reinforcement learning approaches significantly reduce resource wastage.

02

Our methods decrease CPU hours compared to feedback loop baseline.

03

Approaches outperform default resource configurations.

Abstract

Scientific workflows are designed as directed acyclic graphs (DAGs) and consist of multiple dependent task definitions. They are executed over a large amount of data, often resulting in thousands of tasks with heterogeneous compute requirements and long runtimes, even on cluster infrastructures. In order to optimize the workflow performance, enough resources, e.g., CPU and memory, need to be provisioned for the respective tasks. Typically, workflow systems rely on user resource estimates which are known to be highly error-prone and can result in over- or underprovisioning. While resource overprovisioning leads to high resource wastage, underprovisioning can result in long runtimes or even failed tasks. In this paper, we propose two different reinforcement learning approaches based on gradient bandits and Q-learning, respectively, in order to minimize resource wastage by selecting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems