Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language   Models

Margaret Li; Suchin Gururangan; Tim Dettmers; Mike Lewis; Tim Althoff,; Noah A. Smith; Luke Zettlemoyer

arXiv:2208.03306·cs.CL·August 8, 2022·25 cites

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff,, Noah A. Smith, Luke Zettlemoyer

PDF

Open Access 2 Repos

TL;DR

The paper introduces Branch-Train-Merge (BTM), a parallel training algorithm for large language models that trains specialized experts independently and merges them, reducing synchronization and improving efficiency.

Contribution

BTM enables embarrassingly parallel training of expert LLMs across multiple domains, eliminating multi-node synchronization and allowing scalable, efficient model updates.

Findings

01

BTM improves perplexity over GPT-style models at similar training costs.

02

Expert domain specialization is crucial for BTM's effectiveness.

03

Scaling BTM to 64 domains yields a model comparable to larger, more costly transformers.

Abstract

We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training the parameters on data for the new domain, and then merging the resulting model back into the set for future use. Experiments show that BTM improves in- and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Softmax · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization