Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models
Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff,, Noah A. Smith, Luke Zettlemoyer

TL;DR
The paper introduces Branch-Train-Merge (BTM), a parallel training algorithm for large language models that trains specialized experts independently and merges them, reducing synchronization and improving efficiency.
Contribution
BTM enables embarrassingly parallel training of expert LLMs across multiple domains, eliminating multi-node synchronization and allowing scalable, efficient model updates.
Findings
BTM improves perplexity over GPT-style models at similar training costs.
Expert domain specialization is crucial for BTM's effectiveness.
Scaling BTM to 64 domains yields a model comparable to larger, more costly transformers.
Abstract
We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training the parameters on data for the new domain, and then merging the resulting model back into the set for future use. Experiments show that BTM improves in- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization
