Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Shengrui Li; Fei Zhao; Kaiyan Zhao; Jieying Ye; Haifeng Liu; Fangcheng Shi; Zheyong Xie; Yao Hu; Shaosheng Cao

arXiv:2602.00747·cs.CL·May 18, 2026

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao

PDF

1 Repo 1 Datasets

TL;DR

DeMix introduces a model merging-based framework that decouples data mixture search from training, enabling efficient discovery of optimal data ratios for large language model pre-training.

Contribution

It proposes a novel model merging approach to predict optimal data mixtures, reducing search costs and improving performance in LLM pre-training.

Findings

01

DeMix achieves higher benchmark performance with lower search cost.

02

It enables evaluation of unlimited data mixtures without additional training.

03

The release of DeMix Corpora supports open research in data mixture discovery.

Abstract

Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Lucius-lsr/DeMix
github

Datasets

lucius1022/DeMix_Corpora
dataset· 24k dl
24k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification