Driving asynchronous distributed tasks with events
Nick Brown, Oliver Thomson Brown, J. Mark Bull

TL;DR
This paper introduces an event-driven task parallelism approach for distributed computing, enabling programmers to explicitly manage task interactions while maintaining abstraction, improving performance for large-scale applications.
Contribution
The paper presents a novel approach where programmers explicitly control task interactions via events in distributed systems, combining flexibility with abstraction.
Findings
Improved performance at large core counts for benchmark and atmospheric modeling applications.
Demonstrated effectiveness of event-driven task management in real-world use cases.
Provided an open-source library applicable to diverse parallel applications.
Abstract
Open-source matters, not just to the current cohort of HPC users but also to potential new HPC communities, such as machine learning, themselves often rooted in open-source. Many of these potential new workloads are, by their very nature, far more asynchronous and unpredictable than traditional HPC codes and open-source solutions must be found to enable new communities of developers to easily take advantage of large scale parallel machines. Task-based models have the potential to help here, but many of these either entirely abstract the user from the distributed nature of their code, placing emphasis on the runtime to make important decisions concerning scheduling and locality, or require the programmer to explicitly combine their task-based code with a distributed memory technology such as MPI, which adds considerable complexity. In this paper we describe a new approach where the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
