Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in   Heterogeneous Clusters

Jonathan Bader; Fabian Lehmann; Lauritz Thamsen; Jonathan Will; Ulf; Leser; Odej Kao

arXiv:2205.11181·cs.DC·May 24, 2022

Lotaru: Locally Estimating Runtimes of Scientific Workflow Tasks in Heterogeneous Clusters

Jonathan Bader, Fabian Lehmann, Lauritz Thamsen, Jonathan Will, Ulf, Leser, Odej Kao

PDF

TL;DR

Lotaru is an online, Bayesian-based method that quickly estimates task runtimes in heterogeneous scientific workflows, improving scheduling accuracy without extensive historical data.

Contribution

It introduces a novel online approach combining microbenchmark profiling and Bayesian regression for local runtime estimation in heterogeneous clusters.

Findings

01

Outperforms baseline methods in prediction accuracy.

02

Effective in both homogeneous and heterogeneous cluster environments.

03

Provides robust uncertainty estimates for scheduling.

Abstract

Many scientific workflow scheduling algorithms need to be informed about task runtimes a-priori to conduct efficient scheduling. In heterogeneous cluster infrastructures, this problem becomes aggravated because these runtimes are required for each task-node pair. Using historical data is often not feasible as logs are typically not retained indefinitely and workloads as well as infrastructure changes. In contrast, online methods, which predict task runtimes on specific nodes while the workflow is running, have to cope with the lack of example runs, especially during the start-up. In this paper, we present Lotaru, a novel online method for locally estimating task runtimes in scientific workflows on heterogeneous clusters. Lotaru first profiles all nodes of a cluster with a set of short-running and uniform microbenchmarks. Next, it runs the workflow to be scheduled on the user's local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.