# rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific   Applications with Parallel Independent Tasks

**Authors:** Ali Mohammed, Aurelien Cavelan, and Florina M. Ciorba

arXiv: 1905.08073 · 2019-10-07

## TL;DR

This paper introduces rDLB, a proactive, fault-tolerant dynamic load balancing method for scientific applications with independent tasks, significantly improving robustness and reducing execution time on HPC systems.

## Contribution

The paper presents a novel proactive load balancing approach, rDLB, that does not rely on failure detection and enhances fault tolerance and performance in HPC environments.

## Key findings

- rDLB tolerates up to (P-1) processor failures.
- rDLB increases robustness by up to 30 times under perturbations.
- rDLB reduces application execution time by up to 7 times.

## Abstract

Scientific applications often contain large and computationally intensive parallel loops. Dynamic loop self scheduling (DLS) is used to achieve a balanced load execution of such applications on high performance computing (HPC) systems. Large HPC systems are vulnerable to processors or node failures and perturbations in the availability of resources. Most self-scheduling approaches do not consider fault-tolerant scheduling or depend on failure or perturbation detection and react by rescheduling failed tasks. In this work, a robust dynamic load balancing (rDLB) approach is proposed for the robust self scheduling of independent tasks. The proposed approach is proactive and does not depend on failure or perturbation detection. The theoretical analysis of the proposed approach shows that it is linearly scalable and its cost decrease quadratically by increasing the system size. rDLB is integrated into an MPI DLS library to evaluate its performance experimentally with two computationally intensive scientific applications. Results show that rDLB enables the tolerance of up to (P minus one) processor failures, where P is the number of processors executing an application. In the presence of perturbations, rDLB boosted the robustness of DLS techniques up to 30 times and decreased application execution time up to 7 times compared to their counterparts without rDLB.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.08073/full.md

## Figures

48 figures with captions in the complete paper: https://tomesphere.com/paper/1905.08073/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/1905.08073/full.md

---
Source: https://tomesphere.com/paper/1905.08073