TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

Yifan Wang; Binbin Liu; Fengze Liu; Yuanfan Guo; Jiyao Deng; Xuecheng Wu; Weidong Zhou; Xiaohuan Zhou; Taifeng Wang

arXiv:2508.17677·cs.LG·August 26, 2025

TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

Yifan Wang, Binbin Liu, Fengze Liu, Yuanfan Guo, Jiyao Deng, Xuecheng Wu, Weidong Zhou, Xiaohuan Zhou, Taifeng Wang

PDF

4 Reviews

TL;DR

TiKMiX introduces a dynamic data mixing method for language model pre-training that adapts to the model's evolving preferences, improving performance and efficiency over static strategies.

Contribution

The paper proposes TiKMiX, a novel approach that dynamically adjusts data mixtures using Group Influence, optimizing training efficiency and model performance.

Findings

01

TiKMiX-D outperforms state-of-the-art methods like REGMIX with less computation.

02

TiKMiX-M achieves an average 2% performance gain across benchmarks.

03

Model data preferences evolve during training, and dynamic adjustment improves results.

Abstract

The data mixture used in the pre-training of a language model is a cornerstone of its final performance. However, a static mixing strategy is suboptimal, as the model's learning preferences for various data domains shift dynamically throughout training. Crucially, observing these evolving preferences in a computationally efficient manner remains a significant challenge. To address this, we propose TiKMiX, a method that dynamically adjusts the data mixture according to the model's evolving preferences. TiKMiX introduces Group Influence, an efficient metric for evaluating the impact of data domains on the model. This metric enables the formulation of the data mixing problem as a search for an optimal, influence-maximizing distribution. We solve this via two approaches: TiKMiX-D for direct optimization, and TiKMiX-M, which uses a regression model to predict a superior mixture. We trained…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 5

Strengths

Quality: - The proposed TikMix method outperforms existing mixes and requires less compute. - Empirical analysis shows that influence scores change throughout training, which motivates the use of dynamic mixing. - Method is theoreticall motivated, and the optimization problem in equation 10 is well-justified and incorporates various desirable aspects of data (e.g., optimizing the mix while preserving data diversity) Significance: - While mixing algorithms that use iterative updates have been

Weaknesses

Originality: - It would be interesting to compare more to existing online approaches; you mention ODM, but here are a few others: - Aioli (https://arxiv.org/abs/2411.05735) - ADO (https://arxiv.org/abs/2410.11820) - These approaches use iterative updates, but that is more of an implementation choice (e.g., using a mirror descent / multiplicative weights style solver versus an exact solver). It would be interesting to see how Aioli---namely, doing some online "re-exploration" of th

Reviewer 02Rating 2Confidence 4

Strengths

1. Strong motivation and practical relevance: The observation that model data preferences evolve during training addresses a real limitation of static mixing strategies used in production LLMs, making this work highly relevant to practitioners. 2. Computational efficiency: Achieving performance gains with only 20% of REGMIX's computational cost by avoiding proxy model training is a significant practical advantage. 3. Comprehensive evaluation: Testing on both in-domain and out-of-domain benchma

Weaknesses

1. Poor writing quality: * Mathematical notation is inconsistent (e.g., θ* used for both original and perturbed optima in Equations 1-2) * Critical Figure 3 showing evolving domain influences is never referenced or explained in the main text. * Dense formula presentation (Section 3.1-3.2) lacks intuitive explanations of why these particular formulations are principled. * Minor errors: "To more" (line 308), symbol confusion between Equations 2 and 9. L426: no period symbol. 2. Un

Reviewer 03Rating 2Confidence 4

Strengths

1. Motivations and related literature are clearly presented. 2. Clean measure of end performance with a clear split into training tasks and the test tasks, not all papers do this.

Weaknesses

1. In Equation (8) can be written as the sum of the influence over the element of the group. It contradicts your motivation L.204 “a linear sum of influence score is insufficient” to introduce group influence in the first place. 2. Why do you need to fit a boosted tree? Once individual group influences are computed a single time, you can then evaluate the influence of any mixture without additional gradient computations, no?

Reviewer 04Rating 4Confidence 4

Strengths

1. The paper is well-written and easy to follow; 2. The proposed method demonstrates an improvements on data efficiency and downstream performance across various benchmarks, compared with multiple static data mixing methods.

Weaknesses

1. **The proposed Group Influence is not novel.** [1] already given a comprehensive analysis on the group effect of data influence back in 2019. 2. **Lack of comparison to existing dynamic data mixing strategies**, e.g. DGA[2], which applies a similar strategy but using a first-order gradient alignment as the data influence assessment. Can you provide comparison on both performance and efficiency between the existing dynamic data mixing methods and your proposed ones? 3. **Lack of efficiency a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.