Breaking Language Barriers: Cross-Lingual Continual Pre-Training at   Scale

Wenzhen Zheng; Wenbo Pan; Xu Xu; Libo Qin; Li Yue; Ming Zhou

arXiv:2407.02118·cs.CL·October 3, 2024

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

Wenzhen Zheng, Wenbo Pan, Xu Xu, Libo Qin, Li Yue, Ming Zhou

PDF

Open Access 1 Video

TL;DR

This paper investigates cross-lingual continual pretraining of large language models, demonstrating resource efficiency, adherence to an extended scaling law, and insights into transferability and catastrophic forgetting across 40 model sizes.

Contribution

It introduces a scalable CPT approach for new languages, extending existing scaling laws, and analyzes transfer effects and mitigation of catastrophic forgetting.

Findings

01

CPT converges faster and saves resources.

02

CPT follows an extended scaling law with joint data-parameter scaling.

03

Replaying data mitigates catastrophic forgetting effectively.

Abstract

In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In this paper, we explore an alternative approach to constructing an LLM for a new language by continually pretraining (CPT) from existing pretrained LLMs, instead of using randomly initialized parameters. Based on parallel experiments on 40 model sizes ranging from 40M to 5B parameters, we find that 1) CPT converges faster and saves significant resources in a scalable manner; 2) CPT adheres to an extended scaling law derived from Hoffmann et al. (2022) with a joint data-parameter scaling term; 3) The compute-optimal data-parameter allocation for CPT markedly differs based on our estimated scaling factors; 4) The effectiveness of transfer at scale is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale· underline

Taxonomy

TopicsSecond Language Learning and Teaching · Educational and Psychological Assessments · EFL/ESL Teaching and Learning