Worldwide Federated Training of Language Models

Alex Iacob; Lorenzo Sani; Bill Marino; Preslav Aleksandrov; and William F. Shen; Nicholas Donald Lane

arXiv:2405.14446·cs.LG·May 28, 2024·1 cites

Worldwide Federated Training of Language Models

Alex Iacob, Lorenzo Sani, Bill Marino, Preslav Aleksandrov, and William F. Shen, Nicholas Donald Lane

PDF

Open Access 5 Reviews

TL;DR

This paper introduces WorldLM, a federated learning system for training language models globally that handles heterogeneity and privacy concerns, outperforming standard federated methods and approaching personalized models.

Contribution

The paper proposes a novel federated learning framework with federations of federations, partial model localization, and adaptive information sharing for global language model training.

Findings

01

WorldLM outperforms standard federations by up to 1.91x in language modeling tasks.

02

It approaches the performance of fully local models.

03

Maintains advantages under privacy-preserving techniques.

Abstract

The reliance of language model training on massive amounts of computation and vast datasets scraped from potentially low-quality, copyrighted, or sensitive data has come into question practically, legally, and ethically. Federated learning provides a plausible alternative by enabling previously untapped data to be voluntarily gathered from collaborating organizations. However, when scaled globally, federated learning requires collaboration across heterogeneous legal, security, and privacy regimes while accounting for the inherent locality of language data; this further exacerbates the established challenge of federated statistical heterogeneity. We propose a Worldwide Federated Language Model Training~(WorldLM) system based on federations of federations, where each federation has the autonomy to account for factors such as its industry, operating jurisdiction, or competitive…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 2

Strengths

1. This work conducts extensive experiments to show its effectiveness.

Weaknesses

1. Section 3.1 mentions that the children under root $q$ has much more critical data heterogeneity than those under parent $p$. I think the authors should justify that this really happens. In my opinion, the setting in clustered FL usually assumes that the clients within a cluster have very similar data distribution while quite distinct from those not in the same cluster [1]. However, in this case, clients may not have such information, i.e., they don't know the data distribution with each other

Reviewer 02Rating 3Confidence 4

Strengths

1. To address the data heterogeneity, authors propose attention-based aggregation and residual embeddings, which are very suitable for language model training. 2. The experiments are well-designed, different aspects (such as FL v.s. centralized, privacy, local task performance) are covered. These results are promising.

Weaknesses

1) Scale can be large. It is great that the authors proposed specific-designed FL method for language model (LM). As we know, LM becomes useful when the scale is very large, i.e., LLM. Thus whether the proposed method can be scaled up is really important. Currently, the authors conduct experiments with a largest size of 400M parameters, which is still small. As the experiments with large size might be difficult, could the authors discuss more details when the size of LM reaches billion scale, gi

Reviewer 03Rating 3Confidence 4

Strengths

- The idea seems to make sense, although its practical application could be better motivated

Weaknesses

- I would appreciate if the authors can motivate the problem more concretely, rather than in a high-level way - I am a little bit confused by the comparison and the key insights we can get from these results. (see Questions)

Reviewer 04Rating 5Confidence 2

Strengths

* The paper addresses an important practical problem in distributed LM training. * The attention-based aggregation mechanism is an interesting approach to handling heterogeneous data (though it does not seem to be a complete solution, it suggests an interesting direction). * Experimental design and evaluation strategy: * Evaluation across multiple model sizes (75M-400M parameters). Scaling experiments show competitive performance with standard FL. * Comprehensive testing combining perplexit

Weaknesses

1. Presentation and Motivation: The paper's introduction and related work sections attempt to cover both technical and policy aspects of federated learning, but in doing so, fail to provide a clear technical foundation. While the data regulation context is interesting to learn about, it comes at the expense of a precise technical exposition. At times, the paper mentions low-level technical concepts (e.g., RingAllReduce, local SGD) without proper explanation. The presentation would benefit from a

Reviewer 05Rating 5Confidence 3

Strengths

1. The federations-of-federations approach allows for adaptable collaboration across various jurisdictions, making it feasible to integrate global, region-specific, or industry-specific data in a way that respects privacy constraints. 2. The backbone with personalized key layers effectively captures and adapts to local variations in data, enhancing performance in heterogeneous settings. 3. WorldLM is robust in applying differential privacy, even where traditional federated learning might strug

Weaknesses

1. The method shows diminished effectiveness when data within a federation lacks inherent similarity, suggesting a need for improved aggregation techniques for highly diverse datasets. 2. While WorldLM works well on medium-sized language models, scaling to larger models could be resource-intensive, especially for smaller organizations with limited computational resources.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Robotics and Automated Systems · Topic Modeling