TL;DR
DUME is a training-free, scalable method that constructs a multi-domain expert language model by dynamically reusing dense experts through a closed-form ridge regression solution, avoiding additional training.
Contribution
It introduces DUME, a novel approach to build multi-domain language models without extra training, outperforming baselines and enabling dynamic expert addition.
Findings
DUME retains up to 97.6% of a dense expert model's performance in a specific domain.
DUME surpasses dense experts in reasoning tasks, achieving 102.1% performance.
The method is cost-efficient, scalable, and can be fine-tuned for further improvements.
Abstract
Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
