Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

Jacob Morrison; Sanjay Adhikesaven; Akshita Bhagia; Matei Zaharia; Noah A. Smith; Sewon Min

arXiv:2604.18473·cs.LG·April 21, 2026

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

Jacob Morrison, Sanjay Adhikesaven, Akshita Bhagia, Matei Zaharia, Noah A. Smith, Sewon Min

PDF

TL;DR

This paper introduces BAR, a modular post-training approach using Mixture-of-Experts that allows independent domain expert updates, reducing costs and avoiding capability degradation in language models.

Contribution

BAR enables scalable, independent domain expert training and composition, improving update efficiency and preventing catastrophic forgetting in language models.

Findings

01

BAR matches or exceeds re-training baselines in performance.

02

Modular training reduces update costs from quadratic to linear.

03

Isolating domains prevents catastrophic forgetting during updates.

Abstract

Extending a fully post-trained language model with new domain capabilities is fundamentally limited by monolithic training paradigms: retraining from scratch is expensive and scales poorly, while continued training often degrades existing capabilities. We present BAR (Branch-Adapt-Route), which trains independent domain experts, each through its own mid-training, supervised finetuning, and reinforcement learning pipeline, and composes them via a Mixture-of-Experts architecture with lightweight router training. Unlike retraining approaches that mix all domains and require full reprocessing for any update (with cost scaling quadratically), BAR enables updating individual experts independently with linear cost scaling and no degradation to existing domains. At the 7B scale, with experts for math, code, tool use, and safety, BAR achieves an overall score of 49.1 (averaged across 7…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.