Towards CXL Resilience to CPU Failures
Antonis Psistakis, Burak Ocalan, Chloe Alverti, Fabien Chaix, Ramnatthan Alagappan, and Josep Torrellas

TL;DR
This paper introduces ReCXL, an extension to the CXL specification that enhances resilience to node failures through replication and logging, enabling fault-tolerant shared-memory computing with minimal performance overhead.
Contribution
ReCXL extends the CXL standard to support fault tolerance against node failures by implementing replication and logging mechanisms for recovery.
Findings
ReCXL achieves fault-tolerance with only 30% slowdown.
The system effectively recovers application state after node failures.
ReCXL maintains data consistency through replication and logging.
Abstract
Compute Express Link (CXL) 3.0 and beyond allows the compute nodes of a cluster to share data with hardware cache coherence and at the granularity of a cache line. This enables shared-memory semantics for distributed computing, but introduces new resilience challenges: a node failure leads to the loss of the dirty data in its caches, corrupting application state. Unfortunately, the CXL specification does not consider processor failures. Moreover, when a component fails, the specification tries to isolate it and continue application execution; there is no attempt to bring the application to a consistent state. To address these limitations, this paper extends the CXL specification to be resilient to node failures, and to correctly recover the application after node failures. We call the system ReCXL. To handle the failure of nodes, ReCXL augments the coherence transaction of a write with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
