Best-Effort Communication Improves Performance and Scales Robustly on Conventional Hardware
Matthew Andres Moreno, Charles Ofria

TL;DR
This study demonstrates that fully-asynchronous, best-effort communication can enhance performance and scalability on commercial HPC hardware, maintaining stable quality of service even at high process counts and in the presence of faults.
Contribution
It provides empirical evidence that best-effort communication strategies improve HPC performance and scalability, with detailed analysis of quality of service metrics at scale.
Findings
Best-effort communication improves computational throughput at high CPU counts.
Quality of service remains stable across scale and under high communication load.
Faulty nodes have minimal impact on overall performance and quality of service.
Abstract
Here, we test the performance and scalability of fully-asynchronous, best-effort communication on existing, commercially-available HPC hardware. A first set of experiments tested whether best-effort communication strategies can benefit performance compared to the traditional perfect communication model. At high CPU counts, best-effort communication improved both the number of computational steps executed per unit time and the solution quality achieved within a fixed-duration run window. Under the best-effort model, characterizing the distribution of quality of service across processing components and over time is critical to understanding the actual computation being performed. Additionally, a complete picture of scalability under the best-effort model requires analysis of how such quality of service fares at scale. To answer these questions, we designed and measured a suite of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Parallel Computing and Optimization Techniques
