TL;DR
TiKMiX introduces a dynamic data mixing method for language model pre-training that adapts to the model's evolving preferences, improving performance and efficiency over static strategies.
Contribution
The paper proposes TiKMiX, a novel approach that dynamically adjusts data mixtures using Group Influence, optimizing training efficiency and model performance.
Findings
TiKMiX-D outperforms state-of-the-art methods like REGMIX with less computation.
TiKMiX-M achieves an average 2% performance gain across benchmarks.
Model data preferences evolve during training, and dynamic adjustment improves results.
Abstract
The data mixture used in the pre-training of a language model is a cornerstone of its final performance. However, a static mixing strategy is suboptimal, as the model's learning preferences for various data domains shift dynamically throughout training. Crucially, observing these evolving preferences in a computationally efficient manner remains a significant challenge. To address this, we propose TiKMiX, a method that dynamically adjusts the data mixture according to the model's evolving preferences. TiKMiX introduces Group Influence, an efficient metric for evaluating the impact of data domains on the model. This metric enables the formulation of the data mixing problem as a search for an optimal, influence-maximizing distribution. We solve this via two approaches: TiKMiX-D for direct optimization, and TiKMiX-M, which uses a regression model to predict a superior mixture. We trained…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Quality: - The proposed TikMix method outperforms existing mixes and requires less compute. - Empirical analysis shows that influence scores change throughout training, which motivates the use of dynamic mixing. - Method is theoreticall motivated, and the optimization problem in equation 10 is well-justified and incorporates various desirable aspects of data (e.g., optimizing the mix while preserving data diversity) Significance: - While mixing algorithms that use iterative updates have been
Originality: - It would be interesting to compare more to existing online approaches; you mention ODM, but here are a few others: - Aioli (https://arxiv.org/abs/2411.05735) - ADO (https://arxiv.org/abs/2410.11820) - These approaches use iterative updates, but that is more of an implementation choice (e.g., using a mirror descent / multiplicative weights style solver versus an exact solver). It would be interesting to see how Aioli---namely, doing some online "re-exploration" of th
1. Strong motivation and practical relevance: The observation that model data preferences evolve during training addresses a real limitation of static mixing strategies used in production LLMs, making this work highly relevant to practitioners. 2. Computational efficiency: Achieving performance gains with only 20% of REGMIX's computational cost by avoiding proxy model training is a significant practical advantage. 3. Comprehensive evaluation: Testing on both in-domain and out-of-domain benchma
1. Poor writing quality: * Mathematical notation is inconsistent (e.g., θ* used for both original and perturbed optima in Equations 1-2) * Critical Figure 3 showing evolving domain influences is never referenced or explained in the main text. * Dense formula presentation (Section 3.1-3.2) lacks intuitive explanations of why these particular formulations are principled. * Minor errors: "To more" (line 308), symbol confusion between Equations 2 and 9. L426: no period symbol. 2. Un
1. Motivations and related literature are clearly presented. 2. Clean measure of end performance with a clear split into training tasks and the test tasks, not all papers do this.
1. In Equation (8) can be written as the sum of the influence over the element of the group. It contradicts your motivation L.204 “a linear sum of influence score is insufficient” to introduce group influence in the first place. 2. Why do you need to fit a boosted tree? Once individual group influences are computed a single time, you can then evaluate the influence of any mixture without additional gradient computations, no?
1. The paper is well-written and easy to follow; 2. The proposed method demonstrates an improvements on data efficiency and downstream performance across various benchmarks, compared with multiple static data mixing methods.
1. **The proposed Group Influence is not novel.** [1] already given a comprehensive analysis on the group effect of data influence back in 2019. 2. **Lack of comparison to existing dynamic data mixing strategies**, e.g. DGA[2], which applies a similar strategy but using a first-order gradient alignment as the data influence assessment. Can you provide comparison on both performance and efficiency between the existing dynamic data mixing methods and your proposed ones? 3. **Lack of efficiency a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
