DoGE: Domain Reweighting with Generalization Estimation

Simin Fan; Matteo Pagliardini; Martin Jaggi

arXiv:2310.15393·cs.LG·February 6, 2024·2 cites

DoGE: Domain Reweighting with Generalization Estimation

Simin Fan, Matteo Pagliardini, Martin Jaggi

PDF

Open Access

TL;DR

This paper introduces DoGE, a principled method for reweighting data domains during pretraining of large language models, leading to improved generalization across various tasks and out-of-domain data.

Contribution

DoGE presents a bi-level optimization approach to learn domain sampling weights, enhancing model generalization without heuristics.

Findings

01

Improves perplexity and reasoning accuracy on SlimPajama dataset

02

Effectively identifies inter-domain dependencies for out-of-domain generalization

03

Achieves better test perplexity on unseen target domains

Abstract

The coverage and composition of the pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (domain weights) in a principled way. Our approach is a two-stage process consisting of (i) training a proxy model to obtain domain weights using a bi-level optimization algorithm; (ii) training a larger base model by sampling training domains according to the learned domain weights. In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. On the SlimPajama dataset, our base model gets better perplexity and few-shot reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsBalanced Selection