Worldwide Federated Training of Language Models
Alex Iacob, Lorenzo Sani, Bill Marino, Preslav Aleksandrov, and William F. Shen, Nicholas Donald Lane

TL;DR
This paper introduces WorldLM, a federated learning system for training language models globally that handles heterogeneity and privacy concerns, outperforming standard federated methods and approaching personalized models.
Contribution
The paper proposes a novel federated learning framework with federations of federations, partial model localization, and adaptive information sharing for global language model training.
Findings
WorldLM outperforms standard federations by up to 1.91x in language modeling tasks.
It approaches the performance of fully local models.
Maintains advantages under privacy-preserving techniques.
Abstract
The reliance of language model training on massive amounts of computation and vast datasets scraped from potentially low-quality, copyrighted, or sensitive data has come into question practically, legally, and ethically. Federated learning provides a plausible alternative by enabling previously untapped data to be voluntarily gathered from collaborating organizations. However, when scaled globally, federated learning requires collaboration across heterogeneous legal, security, and privacy regimes while accounting for the inherent locality of language data; this further exacerbates the established challenge of federated statistical heterogeneity. We propose a Worldwide Federated Language Model Training~(WorldLM) system based on federations of federations, where each federation has the autonomy to account for factors such as its industry, operating jurisdiction, or competitive…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. This work conducts extensive experiments to show its effectiveness.
1. Section 3.1 mentions that the children under root $q$ has much more critical data heterogeneity than those under parent $p$. I think the authors should justify that this really happens. In my opinion, the setting in clustered FL usually assumes that the clients within a cluster have very similar data distribution while quite distinct from those not in the same cluster [1]. However, in this case, clients may not have such information, i.e., they don't know the data distribution with each other
1. To address the data heterogeneity, authors propose attention-based aggregation and residual embeddings, which are very suitable for language model training. 2. The experiments are well-designed, different aspects (such as FL v.s. centralized, privacy, local task performance) are covered. These results are promising.
1) Scale can be large. It is great that the authors proposed specific-designed FL method for language model (LM). As we know, LM becomes useful when the scale is very large, i.e., LLM. Thus whether the proposed method can be scaled up is really important. Currently, the authors conduct experiments with a largest size of 400M parameters, which is still small. As the experiments with large size might be difficult, could the authors discuss more details when the size of LM reaches billion scale, gi
- The idea seems to make sense, although its practical application could be better motivated
- I would appreciate if the authors can motivate the problem more concretely, rather than in a high-level way - I am a little bit confused by the comparison and the key insights we can get from these results. (see Questions)
* The paper addresses an important practical problem in distributed LM training. * The attention-based aggregation mechanism is an interesting approach to handling heterogeneous data (though it does not seem to be a complete solution, it suggests an interesting direction). * Experimental design and evaluation strategy: * Evaluation across multiple model sizes (75M-400M parameters). Scaling experiments show competitive performance with standard FL. * Comprehensive testing combining perplexit
1. Presentation and Motivation: The paper's introduction and related work sections attempt to cover both technical and policy aspects of federated learning, but in doing so, fail to provide a clear technical foundation. While the data regulation context is interesting to learn about, it comes at the expense of a precise technical exposition. At times, the paper mentions low-level technical concepts (e.g., RingAllReduce, local SGD) without proper explanation. The presentation would benefit from a
1. The federations-of-federations approach allows for adaptable collaboration across various jurisdictions, making it feasible to integrate global, region-specific, or industry-specific data in a way that respects privacy constraints. 2. The backbone with personalized key layers effectively captures and adapts to local variations in data, enhancing performance in heterogeneous settings. 3. WorldLM is robust in applying differential privacy, even where traditional federated learning might strug
1. The method shows diminished effectiveness when data within a federation lacks inherent similarity, suggesting a need for improved aggregation techniques for highly diverse datasets. 2. While WorldLM works well on medium-sized language models, scaling to larger models could be resource-intensive, especially for smaller organizations with limited computational resources.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Robotics and Automated Systems · Topic Modeling
