Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Mozhi Zhang; Howe Tissue; Lu Wang; Xipeng Qiu

arXiv:2506.10952·cs.CL·June 13, 2025

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

PDF

Open Access

TL;DR

Domain2Vec is a new method that decomposes datasets into meta-domains to identify optimal data mixtures for language model pretraining without additional training, improving efficiency and performance.

Contribution

It introduces the concept of meta-domains and a domain vector decomposition approach for training-free data mixture selection in language model pretraining.

Findings

01

Achieves same validation loss with 48.5% less computation on Pile-CC.

02

Improves downstream task performance by 2.83% on average under same compute budget.

03

Enhances scalability of data mixture optimization methods.

Abstract

We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA $^{2}$ ), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsSparse Evolutionary Training