Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria, Lin, Baptiste Rozi\`ere, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston,, Xian Li

TL;DR
The paper introduces Branch-Train-MiX (BTX), a scalable method for training multi-domain expert LLMs by asynchronously training experts and integrating them into a Mixture-of-Experts architecture, improving efficiency and accuracy.
Contribution
BTX offers a novel scalable training framework that combines expert specialization with MoE fine-tuning, outperforming previous methods in efficiency and accuracy.
Findings
BTX achieves superior accuracy-efficiency tradeoff.
Asynchronous expert training reduces communication costs.
BTX generalizes existing methods like Branch-Train-Merge and sparse upcycling.
Abstract
We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
MethodsMixture of Experts
