CXL Shared Memory Programming: Barely Distributed and Almost Persistent
Yi Xu, Suyash Mahar, Ziheng Liu, Mingyao Shen, and Steven Swanson

TL;DR
This paper introduces a failure model for CXL shared memory systems, addressing unique data and process failures, and proposes tailored mitigation techniques inspired by PMEM solutions.
Contribution
It is the first work to define a failure model for CXL shared memory and to propose specific solutions for its unique failure modes.
Findings
Defined a failure model for CXL shared memory
Proposed new failure mitigation techniques for data failures
Compared CXL failures with PMEM failure models
Abstract
While Compute Express Link (CXL) enables support for cache-coherent shared memory among multiple nodes, it also introduces new types of failures--processes can fail before data does, or data might fail before a process does. The lack of a failure model for CXL-based shared memory makes it challenging to understand and mitigate these failures. To solve these challenges, in this paper, we describe a model categorizing and handling the CXL-based shared memory's failures: data and process failures. Data failures in CXL-based shared memory render data inaccessible or inconsistent for a currently running application. We argue that such failures are unlike data failures in distributed storage systems and require CXL-specific handling. To address this, we look into traditional data failure mitigation techniques like erasure coding and replication and propose new solutions to better handle…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Computability, Logic, AI Algorithms
