Distributed asynchronous convergence detection without detection protocol
Guillaume Gbikpi-Benissan, Frederic Magoules

TL;DR
This paper explores asynchronous convergence detection in parallel iterative processes, proposing a snapshot-based approach and demonstrating that reliable global residual error computation can be achieved without complex detection mechanisms on stable high-performance computing platforms.
Contribution
It introduces a reliable global residual error computation method using snapshot protocols, simplifying convergence detection in asynchronous parallel algorithms.
Findings
Snapshot-based residual error computation is effective.
Non-blocking reduction operations save time in stable environments.
High-performance platforms support simplified convergence detection.
Abstract
In this paper, we address the problem of detecting the moment when an ongoing asynchronous parallel iterative process can be terminated to provide a sufficiently precise solution to a fixed-point problem being solved. Formulating the detection problem as a global solution identification problem, we analyze the snapshot-based approach, which is the only one that allows for exact global residual error computation. From a recently developed approximate snapshot protocol providing a reliable global residual error, we experimentally investigate here, as well, the reliability of a global residual error computed without any prior particular detection mechanism. Results on a single-site supercomputer successfully show that such high-performance computing platforms possibly provide computational environments stable enough to allow for simply resorting to non-blocking reduction operations for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
