Estimating the Effects of Sample Training Orders for Large Language Models without Retraining
Hao Yang, Haoxuan Li, Mengyue Yang, Xu Chen, Mingming Gong

TL;DR
This paper introduces a retraining-free framework to estimate the effects of training sample order on large language models, enabling efficient analysis of training dynamics, curriculum design, and memorization without costly retraining.
Contribution
We develop a novel method that approximates model updates using Taylor expansions and random projections, allowing for efficient estimation of model performance under different training orders without retraining.
Findings
The framework accurately reproduces true model performances.
It enables effective curriculum optimization for LLMs.
It provides insights into memorization and generalization effects.
Abstract
The order of training samples plays a crucial role in large language models (LLMs), significantly impacting both their external performance and internal learning dynamics. Traditional methods for investigating this effect generally require retraining the model with various sample orders, which is computationally infeasible for LLMs. In this work, we improve traditional methods by designing a retraining-free framework. By approximating Adam optimizer updates with first- and second-order Taylor expansions and utilizing random projection methods to store intermediate checkpoints, our framework can efficiently estimate model parameters for arbitrary training sample orders. Next, we apply our framework to two downstream research problems: (1) Training curriculum design for LLMs -- we base our retraining-free framework to propose a novel curriculum learning strategy that augments curriculum…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper's primary strength is its novel attempt to tackle the intractable problem of data ordering effects in LLM training. The proposed FUT framework is theoretically principled, grounded in Taylor expansions, and offers a new approach to estimate the effect of data curriculum. The empirical results, while limited in scale, demonstrate that the method can outperform a naive baseline in its estimation task and can be used to design curricula that are superior to common heuristics.
1. **Questionable Practicality and Scalability:** The paper's claim of "computational efficiency" is based on an amortized analysis that masks an extremely impractical workflow for large-scale LLM pre-training. The method requires a full, costly reference run and a subsequent pre-computation stage with $\mathcal{O}(T^2 \cdot C)$ complexity. The experiments are conducted on a maximum of $T=256$ batches, a scale that is trivial compared to the millions of steps in a typical pre-training run. At th
- **Problem Significance:** The paper tackles a well-defined, important, and challenging problem. Developing a retraining-free method to analyze sample order would be a major contribution to understanding LLM training dynamics, curriculum learning, and interpretability. - **Novel Problem Formulation:** The paper provides a clear problem formulation in Section 2, formally defining the goal of estimating the parameter trajectory for a new sample order based on a reference trajectory.
- **Scalability Issue:** The framework's core premise relies on a precomputation and storage step that is $\mathcal{O}(T^2)$ in $T$, the number of training batches. This complexity is fundamentally non-scalable and practically infeasible for any realistic LLM training scenario, where $T$ is often in the hundreds of thousands or millions. The paper's own data in Appendix D.2 (Table 4) confirms this: storing the *uncompressed* update terms for a small 200M model with only $T=256$ batches already r
1. The intuition behind this paper is easy to understand. 2. The illustration is clear and helpful.
1. The improvement is limited. From Table 2, we can see that comparing against baselines, FUT and FUT++ improves a little. Considering the timing cost of FUT, I am not sure if we should utilize this algorithm in practice. 2. The method is tested in limited settings. I have doubt if the method is robust across different model sizes, model architectures and dataset distributions. The effectiveness of the algorithm should be tested on various scenarios. 3. In Table 2, the author is encouraged to li
* The paper addresses a critical and practical problem in LLM training. Evaluating the impact of data ordering without incurring the prohibitive cost of full retraining is a highly valuable research direction. * The proposed FUT framework is technically well-grounded. Deriving estimations from a mathematical approximation of the optimization dynamics via Taylor expansions provides a clear and interpretable foundation, moving beyond purely heuristic methods. * The framework's methodology, p
* The empirical evaluation is solely based on perplexity (PPL). It remains unsubstantiated whether a PPL-optimal training order, as identified by FUT, translates to improved performance on established downstream benchmarks (e.g., MMLU, GSM8K). * The experimental comparison is weak. FUT is primarily compared against full retraining and a simple random baseline. The paper omits a crucial comparison with more relevant and advanced methods for retraining-free analysis, such as modern, Hessian-f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
