TL;DR
This paper empirically investigates methods to mitigate catastrophic forgetting during multi-domain pre-training of large language models, showing elastic weight consolidation as the most effective approach with minimal performance loss.
Contribution
It systematically evaluates known techniques for reducing catastrophic forgetting in multi-domain language model pre-training, highlighting elastic weight consolidation's effectiveness.
Findings
Elastic weight consolidation yields minimal performance drop (0.33%) on generic tasks.
EWC remains competitive on bio-medical tasks.
Gradient and latent clustering improve data coverage in mitigation methods.
Abstract
Pre-training large language models has become a standard in the natural language processing community. Such models are pre-trained on generic data (e.g. BookCorpus and English Wikipedia) and often fine-tuned on tasks in the same domain. However, in order to achieve state-of-the-art performance on out of domain tasks such as clinical named entity recognition and relation extraction, additional in domain pre-training is required. In practice, staged multi-domain pre-training presents performance deterioration in the form of catastrophic forgetting (CF) when evaluated on a generic benchmark such as GLUE. In this paper we conduct an empirical investigation into known methods to mitigate CF. We find that elastic weight consolidation provides best overall scores yielding only a 0.33% drop in performance across seven generic tasks while remaining competitive in bio-medical tasks. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsExperience Replay
