Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
Stefan Behfar, Richard Mortier

TL;DR
This paper introduces AW-PSP, an extension to Probabilistic Synchronous Parallel, which dynamically adjusts node sampling in federated learning to improve robustness and fairness amid correlated device failures.
Contribution
It proposes a novel availability-weighted sampling method that accounts for correlated failures, enhancing fairness and robustness in federated learning systems.
Findings
AW-PSP improves robustness to failures.
It increases label coverage in training.
It reduces fairness variance among nodes.
Abstract
Probabilistic Synchronous Parallel (PSP) is a technique in distributed learning systems to reduce synchronization bottlenecks by sampling a subset of participating nodes per round. In Federated Learning (FL), where edge devices are often unreliable due to factors including mobility, power constraints, and user activity, PSP helps improve system throughput. However, PSP has a key limitation: it assumes device behavior is static and different devices are independent. This can lead to unfair distributed synchronization, due to highly available nodes dominating training while those that are often unavailable rarely participate and so their data may be missed. If both data distribution and node availability are simultaneously correlated with the device, then both PSP and standard FL algorithms will suffer from persistent under-representation of certain classes or groups resulting in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
