FedQueue: Queue-Aware Federated Learning for Cross-Facility HPC Training
Yijiang Li, Emon Dey, Zilinghan Li, Krishnan Raghavan, Ravi Madduri, Kibaek Kim

TL;DR
FedQueue is a novel queue-aware federated learning protocol designed for HPC environments, effectively managing scheduler delays and heterogeneity to improve training efficiency and convergence.
Contribution
It introduces queue delay prediction, cutoff-based admission, and staleness-aware aggregation, with proven convergence and real-world performance improvements.
Findings
20.5% improvement over baseline algorithms in real deployment.
34% reduction in time to reach target accuracy under high queue variance.
Bounded staleness achieved with high probability despite queue prediction errors.
Abstract
Federated learning (FL) across multiple HPC facilities faces stochastic admission delays from batch schedulers that dominate wall-clock time. Synchronous FL suffers from severe stragglers, while asynchronous FL accumulates stale updates when queues spike. We propose FedQueue, a queue-aware FL protocol that incorporates scheduler delays directly into training and aggregation, which (i) predicts per-facility queue delays online to budget local work, (ii) applies cutoff-based admission that buffers late arrivals to bound staleness, and (iii) performs staleness-aware aggregation to stabilize heterogeneous local workloads. We prove the convergence for non-convex objectives at rate under bounded staleness, and show that the admission controls yield bounded staleness with high probability under queue-prediction error. Real-world cross-facility deployment of FedQueue…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
