Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale
Wenzhen Zheng, Wenbo Pan, Xu Xu, Libo Qin, Li Yue, Ming Zhou

TL;DR
This paper investigates cross-lingual continual pretraining of large language models, demonstrating resource efficiency, adherence to an extended scaling law, and insights into transferability and catastrophic forgetting across 40 model sizes.
Contribution
It introduces a scalable CPT approach for new languages, extending existing scaling laws, and analyzes transfer effects and mitigation of catastrophic forgetting.
Findings
CPT converges faster and saves resources.
CPT follows an extended scaling law with joint data-parameter scaling.
Replaying data mitigates catastrophic forgetting effectively.
Abstract
In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In this paper, we explore an alternative approach to constructing an LLM for a new language by continually pretraining (CPT) from existing pretrained LLMs, instead of using randomly initialized parameters. Based on parallel experiments on 40 model sizes ranging from 40M to 5B parameters, we find that 1) CPT converges faster and saves significant resources in a scalable manner; 2) CPT adheres to an extended scaling law derived from Hoffmann et al. (2022) with a joint data-parameter scaling term; 3) The compute-optimal data-parameter allocation for CPT markedly differs based on our estimated scaling factors; 4) The effectiveness of transfer at scale is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSecond Language Learning and Teaching · Educational and Psychological Assessments · EFL/ESL Teaching and Learning
