Lifelong Language Pretraining with Distribution-Specialized Experts

Wuyang Chen; Yanqi Zhou; Nan Du; Yanping Huang; James Laudon; Zhifeng; Chen; Claire Cu

arXiv:2305.12281·cs.CL·May 23, 2023·6 cites

Lifelong Language Pretraining with Distribution-Specialized Experts

Wuyang Chen, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng, Chen, Claire Cu

PDF

Open Access 1 Video

TL;DR

This paper introduces Lifelong-MoE, a dynamic mixture-of-experts architecture that adapts to new data distributions in language modeling while preserving prior knowledge, improving few-shot performance across multiple NLP tasks.

Contribution

The paper proposes Lifelong-MoE, a novel, extensible MoE architecture with regularized pretraining that enables efficient lifelong learning in language models.

Findings

01

Lifelong-MoE improves adaptation to data shifts with minimal additional capacity.

02

The model maintains previous knowledge while adapting to new data.

03

Lifelong-MoE outperforms existing lifelong learning methods on NLP tasks.

Abstract

Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to enable information systems to learn from a continuous data stream across time. However, most prior work modifies the training recipe assuming a static fixed network architecture. We find that additional model capacity and proper regularization are key elements to achieving strong LLL performance. Thus, we propose Lifelong-MoE, an extensible MoE (Mixture-of-Experts) architecture that dynamically adds model capacity via adding experts with regularized pretraining. Our results show that by only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Lifelong Language Pretraining with Distribution-Specialized Experts· slideslive

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

Methodsfail