Efficient Continual Pre-training by Mitigating the Stability Gap

Yiduo Guo; Jie Fu; Huishuai Zhang; Dongyan Zhao; Yikang Shen

arXiv:2406.14833·cs.CL·June 28, 2024·1 cites

Efficient Continual Pre-training by Mitigating the Stability Gap

Yiduo Guo, Jie Fu, Huishuai Zhang, Dongyan Zhao, Yikang Shen

PDF

Open Access

TL;DR

This paper investigates the stability gap during continual pre-training of LLMs and proposes strategies to mitigate performance drops, leading to more efficient domain adaptation and improved medical task performance.

Contribution

It introduces three novel strategies to reduce the stability gap in continual pre-training, enhancing efficiency and performance of LLMs in new domains.

Findings

01

Strategies improve medical task performance from 36.2% to 40.7% with less training

02

Enhanced models outperform current open-source models on medical benchmarks

03

Proposed methods enable faster recovery and better domain adaptation

Abstract

Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer