DoGE: Domain Reweighting with Generalization Estimation
Simin Fan, Matteo Pagliardini, Martin Jaggi

TL;DR
This paper introduces DoGE, a principled method for reweighting data domains during pretraining of large language models, leading to improved generalization across various tasks and out-of-domain data.
Contribution
DoGE presents a bi-level optimization approach to learn domain sampling weights, enhancing model generalization without heuristics.
Findings
Improves perplexity and reasoning accuracy on SlimPajama dataset
Effectively identifies inter-domain dependencies for out-of-domain generalization
Achieves better test perplexity on unseen target domains
Abstract
The coverage and composition of the pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (domain weights) in a principled way. Our approach is a two-stage process consisting of (i) training a proxy model to obtain domain weights using a bi-level optimization algorithm; (ii) training a larger base model by sampling training domains according to the learned domain weights. In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. On the SlimPajama dataset, our base model gets better perplexity and few-shot reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsBalanced Selection
