CMR Scaling Law: Predicting Critical Mixture Ratios for Continual   Pre-training of Language Models

Jiawei Gu; Zacc Yang; Chuanghao Ding; Rui Zhao; Fei Tan

arXiv:2407.17467·cs.CL·October 8, 2024

CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models

Jiawei Gu, Zacc Yang, Chuanghao Ding, Rui Zhao, Fei Tan

PDF

TL;DR

This paper introduces the CMR scaling law, a power-law relationship that predicts the optimal mixture ratio of general and domain-specific data in continual pre-training of LLMs, enhancing efficiency and performance.

Contribution

It formalizes the Critical Mixture Ratio (CMR) and demonstrates its ability to optimize data mixture for improved LLM continual pre-training.

Findings

01

Discovered a power-law relationship between loss, mixture ratio, and training scale.

02

Defined the Critical Mixture Ratio (CMR) for balancing general and domain-specific capabilities.

03

Validated the CMR scaling law through extensive experiments.

Abstract

Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or proprietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however, has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling behavior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general and domain data. By striking the balance, CMR maintains the model's general…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.