Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

TL;DR
Domain2Vec is a new method that decomposes datasets into meta-domains to identify optimal data mixtures for language model pretraining without additional training, improving efficiency and performance.
Contribution
It introduces the concept of meta-domains and a domain vector decomposition approach for training-free data mixture selection in language model pretraining.
Findings
Achieves same validation loss with 48.5% less computation on Pile-CC.
Improves downstream task performance by 2.83% on average under same compute budget.
Enhances scalability of data mixture optimization methods.
Abstract
We introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several \emph{meta-domains}, a new concept designed to capture the key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \emph{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, \textsc{Domain2vec} can be seamlessly integrated into previous works to model the relationship between domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
