MODC: Resilience for disaggregated memory architectures using task-based programming
Kimberly Keeton, Sharad Singhal, Haris Volos, Yupu Zhang and, Ramesh Chandra Chaurasiya, Clarete Riana Crasta, Sherin T George and, Nagaraju K N, Mashood Abdulla K, Kavitha Natarajan, Porno Shome and, Sanish Suresh

TL;DR
MODC introduces a task-based programming framework for disaggregated memory architectures, enhancing resilience by leveraging failure independence and outperforming traditional checkpointing methods.
Contribution
This paper presents MODC, a novel framework that adapts task-based programming and fault tolerance techniques to disaggregated memory architectures for improved resilience.
Findings
MODC outperforms checkpoint-based resilience methods in experiments.
Disaggregated memory architectures benefit from task-based resilience strategies.
MODC demonstrates effective fault tolerance in disaggregated systems.
Abstract
Disaggregated memory architectures provide benefits to applications beyond traditional scale out environments, such as independent scaling of compute and memory resources. They also provide an independent failure model, where computations or the compute nodes they run on may fail independently of the disaggregated memory; thus, data that's resident in the disaggregated memory is unaffected by the compute failure. Blind application of traditional techniques for resilience (e.g., checkpoints or data replication) does not take advantage of these architectures. To demonstrate the potential benefit of these architectures for resilience, we develop Memory-Oriented Distributed Computing (MODC), a framework for programming disaggregated architectures that borrows and adapts ideas from task-based programming models, concurrent programming techniques, and lock-free data structures. This framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
