Lifelong Language Pretraining with Distribution-Specialized Experts
Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng, Chen, Claire Cu

TL;DR
This paper introduces Lifelong-MoE, a dynamic mixture-of-experts architecture that adapts to new data distributions in language modeling while preserving prior knowledge, improving few-shot performance across multiple NLP tasks.
Contribution
The paper proposes Lifelong-MoE, a novel, extensible MoE architecture with regularized pretraining that enables efficient lifelong learning in language models.
Findings
Lifelong-MoE improves adaptation to data shifts with minimal additional capacity.
The model maintains previous knowledge while adapting to new data.
Lifelong-MoE outperforms existing lifelong learning methods on NLP tasks.
Abstract
Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to enable information systems to learn from a continuous data stream across time. However, most prior work modifies the training recipe assuming a static fixed network architecture. We find that additional model capacity and proper regularization are key elements to achieving strong LLL performance. Thus, we propose Lifelong-MoE, an extensible MoE (Mixture-of-Experts) architecture that dynamically adds model capacity via adding experts with regularized pretraining. Our results show that by only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
Methodsfail
