Scaling Parameter-Constrained Language Models with Quality Data
Ernie Chang, Matteo Paltenghi, Yang Li, Pin-Jie Lin, Changsheng Zhao,, Patrick Huber, Zechun Liu, Rastislav Rabatin, Yangyang Shi, Vikas Chandra

TL;DR
This paper extends language model scaling laws by incorporating data quality metrics, specifically effective training tokens based on diversity and syntheticity, to better predict model performance.
Contribution
It introduces a new formulation of effective training tokens that accounts for data quality, enhancing the understanding of scaling laws in parameter-constrained language models.
Findings
Effective training tokens correlate with model accuracy (Pearson +0.83).
Data quality metrics improve predictions of model performance.
Analysis of data sampling and synthesis techniques on data quality impact.
Abstract
Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization. In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation -- effective training tokens -- which we posit to be a critical determinant of performance for parameter-constrained language models. Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text: (i) text diversity and (ii) syntheticity as measured by a teacher model. We pretrained over models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
MethodsSparse Evolutionary Training
