Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications
Vivek Balasubramanian, Matteo Turilli, Weiming Hu, Matthieu Lefebvre,, Wenjie Lei, Guido Cervone, Jeroen Tromp, Shantenu Jha

TL;DR
The paper presents EnTK, a scalable, fault-tolerant toolkit designed to efficiently execute diverse ensemble applications across heterogeneous infrastructures, demonstrated through seismic inversion and analog ensemble use cases.
Contribution
EnTK introduces novel abstractions and scalable architecture for ensemble application execution, supporting heterogeneity and fault tolerance, with demonstrated performance at large scale.
Findings
EnTK scales efficiently up to 10,000 tasks.
EnTK supports heterogeneous computing infrastructures.
EnTK enables fault-tolerant ensemble executions.
Abstract
Many scientific problems require multiple distinct computational tasks to be executed in order to achieve a desired solution. We introduce the Ensemble Toolkit (EnTK) to address the challenges of scale, diversity and reliability they pose. We describe the design and implementation of EnTK, characterize its performance and integrate it with two distinct exemplar use cases: seismic inversion and adaptive analog ensembles. We perform nine experiments, characterizing EnTK overheads, strong and weak scalability, and the performance of two use case implementations, at scale and on production infrastructures. We show how EnTK meets the following general requirements: (i) implementing dedicated abstractions to support the description and execution of ensemble applications; (ii) support for execution on heterogeneous computing infrastructures; (iii) efficient scalability up to O(10^4) tasks; and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
