Simple and Scalable Strategies to Continually Pre-train Large Language   Models

Adam Ibrahim; Benjamin Th\'erien; Kshitij Gupta; Mats L. Richter,; Quentin Anthony; Timoth\'ee Lesort; Eugene Belilovsky; and Irina Rish

arXiv:2403.08763·cs.LG·September 5, 2024·3 cites

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Adam Ibrahim, Benjamin Th\'erien, Kshitij Gupta, Mats L. Richter,, Quentin Anthony, Timoth\'ee Lesort, Eugene Belilovsky, and Irina Rish

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

This paper presents simple, scalable continual pre-training strategies for large language models that match re-training performance with significantly less compute, effectively handling distribution shifts during updates.

Contribution

It introduces a combination of learning rate re-warming, re-decaying, and data replay that enables efficient continual pre-training of LLMs across different distribution shifts.

Findings

01

Continual strategies match full re-training performance on multiple benchmarks.

02

Effective for models up to 10B parameters with large datasets.

03

Proposes alternative learning rate schedules to reduce forgetting.

Abstract

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eleutherai/gpt-neox
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques