Towards Effective and Efficient Continual Pre-training of Large Language   Models

Jie Chen; Zhipeng Chen; Jiapeng Wang; Kun Zhou; Yutao Zhu; Jinhao; Jiang; Yingqian Min; Wayne Xin Zhao; Zhicheng Dou; Jiaxin Mao; Yankai Lin,; Ruihua Song; Jun Xu; Xu Chen; Rui Yan; Zhewei Wei; Di Hu; Wenbing Huang,; Ji-Rong Wen

arXiv:2407.18743·cs.CL·July 29, 2024·2 cites

Towards Effective and Efficient Continual Pre-training of Large Language Models

Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao, Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin,, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang,, Ji-Rong Wen

PDF

Open Access 2 Models 1 Datasets 1 Video

TL;DR

This paper presents a continual pre-training approach for Llama-3 that enhances Chinese language and scientific reasoning abilities using synthetic data, without degrading original capabilities, demonstrated through extensive benchmark improvements.

Contribution

The paper introduces a novel continual pre-training method with specific data strategies that significantly improve Llama-3's domain-specific and reasoning skills while preserving its original abilities.

Findings

01

Improved performance on general abilities (+8.81 on C-Eval, +6.31 on CMMLU)

02

Enhanced scientific reasoning (+12.00 on MATH, +4.13 on SciEval)

03

Effective data synthesis and curriculum strategies

Abstract

Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. To make the CPT approach more traceable, this paper presents a technical report for continually pre-training Llama-3 (8B), which significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model. To enhance the new abilities while retaining the original abilities, we design specific data mixture and curriculum strategies by utilizing existing datasets and synthesizing high-quality datasets. Specifically, we synthesize multidisciplinary scientific question and answer (QA) pairs based on related web pages, and subsequently incorporate these synthetic data to improve the scientific reasoning ability of Llama-3. We refer to the model after CPT as Llama-3-SynE (Synthetic data Enhanced Llama-3). We also present the tuning experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

RUC-AIBOX/Llama-3-SynE-Dataset
dataset· 1.0k dl
1.0k dl

Videos

Towards Effective and Efficient Continual Pre-training of Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis