Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
Jacob Morrison, Sanjay Adhikesaven, Akshita Bhagia, Matei Zaharia, Noah A. Smith, Sewon Min

TL;DR
This paper introduces BAR, a modular post-training approach using Mixture-of-Experts that allows independent domain expert updates, reducing costs and avoiding capability degradation in language models.
Contribution
BAR enables scalable, independent domain expert training and composition, improving update efficiency and preventing catastrophic forgetting in language models.
Findings
BAR matches or exceeds re-training baselines in performance.
Modular training reduces update costs from quadratic to linear.
Isolating domains prevents catastrophic forgetting during updates.
Abstract
Extending a fully post-trained language model with new domain capabilities is fundamentally limited by monolithic training paradigms: retraining from scratch is expensive and scales poorly, while continued training often degrades existing capabilities. We present BAR (Branch-Adapt-Route), which trains independent domain experts, each through its own mid-training, supervised finetuning, and reinforcement learning pipeline, and composes them via a Mixture-of-Experts architecture with lightweight router training. Unlike retraining approaches that mix all domains and require full reprocessing for any update (with cost scaling quadratically), BAR enables updating individual experts independently with linear cost scaling and no degradation to existing domains. At the 7B scale, with experts for math, code, tool use, and safety, BAR achieves an overall score of 49.1 (averaged across 7…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
