Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi-threaded Programs
Xiang Fu, Shiman Meng, Weiping Zhang, Luanzheng Guo, Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz

TL;DR
This paper introduces two novel distributed clock and epoch-based techniques to improve the efficiency of recording and replaying multi-threaded OpenMP programs, significantly reducing synchronization overhead.
Contribution
The paper presents new distributed recording schemes that eliminate excessive synchronization in OpenMP replay, enabling scalable and efficient deterministic replay of multi-threaded applications.
Findings
2-5x more efficient than traditional synchronization methods
Can be integrated with MPI replay tools with minimal overhead
Successfully applied to HPC applications and MPI+OpenMP scenarios
Abstract
After all these years and all these other shared memory programming frameworks, OpenMP is still the most popular one. However, its greater levels of non-deterministic execution makes debugging and testing more challenging. The ability to record and deterministically replay the program execution is key to address this challenge. However, scalably replaying OpenMP programs is still an unresolved problem. In this paper, we propose two novel techniques that use Distributed Clock (DC) and Distributed Epoch (DE) recording schemes to eliminate excessive thread synchronization for OpenMP record and replay. Our evaluation on representative HPC applications with ReOMP, which we used to realize DC and DE recording, shows that our approach is 2-5x more efficient than traditional approaches that synchronize on every shared-memory access. Furthermore, we demonstrate that our approach can be easily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed systems and fault tolerance · Advanced Data Storage Technologies
