Metadata Conditioning Accelerates Language Model Pre-training

Tianyu Gao; Alexander Wettig; Luxi He; Yihe Dong; Sadhika Malladi; Danqi Chen

arXiv:2501.01956·cs.CL·June 30, 2025

Metadata Conditioning Accelerates Language Model Pre-training

Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen

PDF

Open Access 1 Repo 1 Video

TL;DR

The paper introduces Metadata Conditioning then Cooldown (MeCo), a simple method that accelerates language model pre-training by using metadata cues, enabling faster training and more controllable outputs without extra computation.

Contribution

MeCo is a novel, straightforward approach that incorporates metadata during pre-training and allows for model steering, improving efficiency and controllability of language models.

Findings

01

MeCo reduces pre-training data requirements by 33%.

02

Models trained with MeCo perform comparably on downstream tasks.

03

MeCo enables steering outputs via metadata cues.

Abstract

The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like www $.$ wikipedia $.$ org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

princeton-pli/meco
pytorchOfficial

Videos

Metadata Conditioning Accelerates Language Model Pre-training· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies