D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large   Language Models

Haoran Que; Jiaheng Liu; Ge Zhang; Chenchen Zhang; Xingwei Qu; Yinghao; Ma; Feiyu Duan; Zhiqi Bai; Jiakai Wang; Yuanxing Zhang; Xu Tan; Jie Fu; Wenbo; Su; Jiamang Wang; Lin Qu; Bo Zheng

arXiv:2406.01375·cs.CL·June 4, 2024·3 cites

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models

Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yinghao, Ma, Feiyu Duan, Zhiqi Bai, Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo, Su, Jiamang Wang, Lin Qu, Bo Zheng

PDF

Open Access 1 Video

TL;DR

This paper introduces the D-CPT Law, a scaling law that predicts optimal domain-specific continual pre-training configurations for large language models, reducing costly trial-and-error and enabling efficient domain adaptation.

Contribution

The paper proposes the D-CPT Law and Cross-Domain D-CPT Law, enabling accurate prediction of optimal mixture ratios and performance for domain-specific LLM pre-training with minimal training costs.

Findings

01

D-CPT Law accurately predicts performance across various domains.

02

Cross-Domain D-CPT Law requires only 1% of training costs for target domain prediction.

03

Experimental results validate the effectiveness and generalizability of the proposed laws.

Abstract

Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training