Efficient Continual Pre-training for Building Domain Specific Large Language Models
Yong Xie, Karan Aggarwal, Aitzaz Ahmad

TL;DR
This paper presents a cost-effective method for developing domain-specific large language models through continual pre-training, demonstrated with a financial domain model that outperforms the base model on domain tasks.
Contribution
It introduces a continual pre-training approach for domain adaptation of LLMs, with effective data selection strategies that reduce training data and costs.
Findings
Continual pre-training improves domain-specific task performance.
Data selection strategies outperform vanilla pre-training with less data.
Cost reduction achieved without sacrificing open-domain task performance.
Abstract
Large language models (LLMs) have demonstrated remarkable open-domain capabilities. LLMs tailored for a domain are typically trained entirely on domain corpus to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs over an existing open-domain LLM. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperform vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper has an appealing overall structure (DACP vs baseline, modifications of DACP vs DACP on domain-performance, mod vs DACP on general performance). The questions addressed are of immediate topical relevance to researchers and practitioners of LLMs. The fact that the proposed modifications to DACP yield better specific performance and better generalization performance is a nice contribution.
The fundamental questions of the paper are valuable to address, but the conclusions drawn from the answers provided by this paper are not as comprehensive or original as one might hope. The demonstration that training a pretrained LLM with continued pretraining on a domain-specific dataset is more efficient than training from scratch is not a surprise, nor a novelty. The results of the proposed efficient-DACP methods are generally positive (though not universally, and the deficiencies are not ex
1. The paper contributes FinPythia-6.9B a foundation model for financial domain via continual pre-training. FinPythia-6.9B outperforms the original LLM on a series of tasks in the financial domain which showcases the feasibility of building domain-specific LLMs in a cost-effective manner. 2. Improving continual pre-training from the data selection aspect is interesting. This paper conducts extensive experiments on different data selection methods and the gained insights can be useful to the comm
1. The tasks for experimental are mainly classification tasks, which is limited as the LLM is powerful and should be evaluated on more complicated tasks or at least some generation tasks. I know the paper conducts qualitative evaluation on some QA samples. Is there any generation task in financial domain that you can use to systematically evaluate FinPythia-6.9B? 2. This is mainly an empirical paper and does not have solid theoretical supports. 3. ETS gives better result than ETA, but I'm wonder
1. DAPT/DACP is an important and practical problem 2. The proposed data selection method is simple to use and easy to understand
In this paper, Domain-adaptive pre-training (DAPT) or DACP is not a novel concept, and the main innovation lies in the proposed data selection method. However, the paper lacks comparisons with other DAPT baselines, which is a significant drawback. For example, some prior works have explored modifications to the DAPT loss or gradient. Notably, [1] addresses continual pre-training but is not compared with in this work, nor are any other mentioned baseline systems in [1]. [1]: Adapting a Language
* The paper proves that continual pre-training can facilitate the LLM's performance on domain-specific LLMs. * The authors use embedding level selection to acquire the essential data samples and show that with 10% data, the pre-training can achieve comparable performance instead of using a large amount of data.
* This paper only demonstrates that the proposed pipeline can be efficient with the data selection on one type of LLM, Pythia, which is insufficient to support the claim of efficiency advantages for other types of LLMs, e.g., LLAMA, OPT. * The comparison with other baseline methods is not fair because more data is used for training in continual pre-training. To illustrate the effectiveness of the continual pre-training, the authors should apply the proposed method to other types of LLMs.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
