TL;DR
This paper reveals that training large language models on mixed datasets can cause abrupt phase transitions in knowledge acquisition, depending on model size and data mixing ratios, due to capacity allocation phenomena.
Contribution
It introduces the concept of phase transitions in LLM training on data mixtures, supported by a theoretical framework and controlled experiments demonstrating these effects.
Findings
Models exhibit sudden knowledge acquisition jumps at critical sizes.
Below a certain data mixing ratio, models memorize little; above it, they memorize rapidly.
Critical mixing ratios follow a power-law relationship with model size.
Abstract
Large Language Models (LLMs) are typically trained on data mixtures: most data come from web scrapes, while a small portion is curated from high-quality sources with dense domain-specific knowledge. In this paper, we show that when training LLMs on such data mixtures, knowledge acquisition from knowledge-dense datasets, unlike training exclusively on knowledge-dense data (arXiv:2404.05405), does not always follow a smooth scaling law but can exhibit phase transitions with respect to the mixing ratio and model size. Through controlled experiments on a synthetic biography dataset mixed with web-scraped data, we demonstrate that: (1) as we increase the model size to a critical value, the model suddenly transitions from memorizing very few to most of the biographies; (2) below a critical mixing ratio, the model memorizes almost nothing even with extensive training, but beyond this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
