Simple and Scalable Strategies to Continually Pre-train Large Language Models
Adam Ibrahim, Benjamin Th\'erien, Kshitij Gupta, Mats L. Richter,, Quentin Anthony, Timoth\'ee Lesort, Eugene Belilovsky, and Irina Rish

TL;DR
This paper presents simple, scalable continual pre-training strategies for large language models that match re-training performance with significantly less compute, effectively handling distribution shifts during updates.
Contribution
It introduces a combination of learning rate re-warming, re-decaying, and data replay that enables efficient continual pre-training of LLMs across different distribution shifts.
Findings
Continual strategies match full re-training performance on multiple benchmarks.
Effective for models up to 10B parameters with large datasets.
Proposes alternative learning rate schedules to reduce forgetting.
Abstract
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗H-D-T/Buzz-8b-Large-v0.5model· 16 dl· ♡ 2916 dl♡ 29
- 🤗LoneStriker/Buzz-8b-Large-v0.5-GGUFmodel· 14 dl· ♡ 114 dl♡ 1
- 🤗LoneStriker/Buzz-8b-Large-v0.5-3.0bpw-h6-exl2model· 4 dl4 dl
- 🤗LoneStriker/Buzz-8b-Large-v0.5-4.0bpw-h6-exl2model· 2 dl2 dl
- 🤗LoneStriker/Buzz-8b-Large-v0.5-5.0bpw-h6-exl2model· 3 dl3 dl
- 🤗LoneStriker/Buzz-8b-Large-v0.5-6.0bpw-h6-exl2model· 3 dl3 dl
- 🤗LoneStriker/Buzz-8b-Large-v0.5-8.0bpw-h8-exl2model· 3 dl3 dl
- 🤗QuantFactory/Buzz-8b-Large-v0.5-GGUFmodel· 73 dl73 dl
- 🤗afrideva/Buzz-8b-Large-v0.5-GGUFmodel· 26 dl26 dl
- 🤗akswelh/NEOXmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
