Efficient Continual Pre-training for Building Domain Specific Large Language Models

Yong Xie; Karan Aggarwal; Aitzaz Ahmad

arXiv:2311.08545·cs.CL·January 13, 2026·5 cites

Efficient Continual Pre-training for Building Domain Specific Large Language Models

Yong Xie, Karan Aggarwal, Aitzaz Ahmad

PDF

Open Access 4 Reviews

TL;DR

This paper presents a cost-effective method for developing domain-specific large language models through continual pre-training, demonstrated with a financial domain model that outperforms the base model on domain tasks.

Contribution

It introduces a continual pre-training approach for domain adaptation of LLMs, with effective data selection strategies that reduce training data and costs.

Findings

01

Continual pre-training improves domain-specific task performance.

02

Data selection strategies outperform vanilla pre-training with less data.

03

Cost reduction achieved without sacrificing open-domain task performance.

Abstract

Large language models (LLMs) have demonstrated remarkable open-domain capabilities. LLMs tailored for a domain are typically trained entirely on domain corpus to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs over an existing open-domain LLM. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperform vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

The paper has an appealing overall structure (DACP vs baseline, modifications of DACP vs DACP on domain-performance, mod vs DACP on general performance). The questions addressed are of immediate topical relevance to researchers and practitioners of LLMs. The fact that the proposed modifications to DACP yield better specific performance and better generalization performance is a nice contribution.

Weaknesses

The fundamental questions of the paper are valuable to address, but the conclusions drawn from the answers provided by this paper are not as comprehensive or original as one might hope. The demonstration that training a pretrained LLM with continued pretraining on a domain-specific dataset is more efficient than training from scratch is not a surprise, nor a novelty. The results of the proposed efficient-DACP methods are generally positive (though not universally, and the deficiencies are not ex

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

1. The paper contributes FinPythia-6.9B a foundation model for financial domain via continual pre-training. FinPythia-6.9B outperforms the original LLM on a series of tasks in the financial domain which showcases the feasibility of building domain-specific LLMs in a cost-effective manner. 2. Improving continual pre-training from the data selection aspect is interesting. This paper conducts extensive experiments on different data selection methods and the gained insights can be useful to the comm

Weaknesses

1. The tasks for experimental are mainly classification tasks, which is limited as the LLM is powerful and should be evaluated on more complicated tasks or at least some generation tasks. I know the paper conducts qualitative evaluation on some QA samples. Is there any generation task in financial domain that you can use to systematically evaluate FinPythia-6.9B? 2. This is mainly an empirical paper and does not have solid theoretical supports. 3. ETS gives better result than ETA, but I'm wonder

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. DAPT/DACP is an important and practical problem 2. The proposed data selection method is simple to use and easy to understand

Weaknesses

In this paper, Domain-adaptive pre-training (DAPT) or DACP is not a novel concept, and the main innovation lies in the proposed data selection method. However, the paper lacks comparisons with other DAPT baselines, which is a significant drawback. For example, some prior works have explored modifications to the DAPT loss or gradient. Notably, [1] addresses continual pre-training but is not compared with in this work, nor are any other mentioned baseline systems in [1]. [1]: Adapting a Language

Reviewer 04Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

* The paper proves that continual pre-training can facilitate the LLM's performance on domain-specific LLMs. * The authors use embedding level selection to acquire the essential data samples and show that with 10% data, the pre-training can achieve comparable performance instead of using a large amount of data.

Weaknesses

* This paper only demonstrates that the proposed pipeline can be efficient with the data selection on one type of LLM, Pythia, which is insufficient to support the claim of efficiency advantages for other types of LLMs, e.g., LLAMA, OPT. * The comparison with other baseline methods is not fair because more data is used for training in continual pre-training. To illustrate the effectiveness of the continual pre-training, the authors should apply the proposed method to other types of LLMs.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning