Prime Collective Communications Library -- Technical Report
Michael Keiblinger, Mario Sieg, Jack Min Ong, Sami Jaghouar, Johannes Hagemann

TL;DR
The Prime Collective Communications Library (PCCL) is a fault-tolerant, dynamic, and efficient collective communication library designed for distributed machine learning over the internet, supporting peer churn, high bandwidth, and concurrent operations.
Contribution
PCCL introduces a novel fault-tolerant programming model with dynamic peer management, efficient collective operations, and support for concurrent communications in distributed ML workloads.
Findings
Achieves up to 45 Gbit/s bandwidth across Europe.
Successfully handles peer churn with exact state parity.
Supports concurrent collective operations with minimal overhead.
Abstract
This report presents the Prime Collective Communications Library (PCCL), a novel fault-tolerant collective communication library designed for distributed ML workloads over the public internet. PCCL introduces a new programming model that enables dynamic peer joining and failure recovery. The library implements efficient collective operations like all-reduce while providing robust fault tolerance mechanisms that allow the system to continue operating even when peers fail or join during ongoing operations. We demonstrate that PCCL's design enables practical solutions to dynamic membership challenges in workloads with repeated operations and deterministic state advancement. Our implementation passes extensive stress tests across all major operating systems, showing reliable operation even under rapid peer churn and concurrent collective operations. By dispatching to multiple connections,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTeaching and Learning Programming
