Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers
Yijiang Li, Zilinghan Li, Kyle Chard, Ian Foster, Todd Munson, Ravi Madduri, Kibaek Kim

TL;DR
This paper develops a scalable federated learning framework for scientific models across multiple supercomputers, demonstrating practical deployment, analyzing heterogeneity impacts, and fine-tuning a large language model for chemistry applications.
Contribution
It introduces a comprehensive cross-facility federated learning framework tailored for heterogeneous HPC environments, addressing deployment challenges and performance considerations.
Findings
Federated learning across HPC facilities is practically feasible.
Heterogeneity significantly impacts training performance.
Scheduler-aware algorithms are crucial for future HPC federated learning deployments.
Abstract
Artificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL) addresses this by enabling collaborative training without centralizing raw data, but scientific applications demand model scales that requires extensive computing resources, typically offered at High Performance Computing (HPC) facilities. Deploying FL experiments across HPC facilities introduces challenges beyond cloud or enterprise settings. We present a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration, and evaluate it across four U.S. Department of Energy (DOE) leadership-class supercomputers. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Big Data and Digital Economy · Stochastic Gradient Optimization Techniques
