User Experiences with MPI RMA and ULFM in a Resilient Key-Value Store Implementation
Claudia Fohry, Rainer Fink

TL;DR
This paper explores implementing a resilient key-value store using MPI RMA and ULFM, highlighting challenges, workarounds, and experiences in developing fault-tolerant MPI-based storage solutions.
Contribution
It presents a novel MPI-based resilient key-value store design utilizing RMA and ULFM, detailing implementation challenges and practical workarounds.
Findings
Implementation was difficult due to incomplete ULFM RMA functionalities.
Workarounds were developed to address missing ULFM features.
The store enables recovery after node failures with data redundancy.
Abstract
As hardware failures such as node losses become increasingly common, MPI programmers may want to save vulnerable data in a resilient store. While third-party storage solutions such as Redis or the Hazelcast IMap exist, a tailored, MPI-based store may be easier to integrate and can be optimized for particular application needs. This paper considers the implementation of such a store, which is intended as a component in a resilient task-based runtime system written in MPI. The store holds redundant data copies as key-value pairs in the main memories of multiple processes. Since store access operations, such as reads and writes, are naturally one-sided, we implemented the store with passive target MPI RMA functions. Process aborts are detected with the user-level failure mitigation (ULFM) extension of Open MPI. After failures, the program recovers on the surviving processes and continues…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
