Capturing the Temporal Dependence of Training Data Influence
Jiachen T. Wang, Dawn Song, James Zou, Prateek Mittal, Ruoxi Jia

TL;DR
This paper introduces a trajectory-specific influence measure for training data that accounts for data order and training dynamics, providing new insights into data influence during model training.
Contribution
It formalizes trajectory-specific leave-one-out influence and proposes data value embedding for efficient approximation, capturing training data ordering effects.
Findings
Data influence varies across training phases.
Early and late training data have greater impact.
Embedding captures training dynamics and influences.
Abstract
Traditional data influence estimation methods, like influence function, assume that learning algorithms are permutation-invariant with respect to training data. However, modern training paradigms, especially for foundation models using stochastic algorithms and multi-stage curricula, are sensitive to data ordering, thus violating this assumption. This mismatch renders influence functions inadequate for answering a critical question in machine learning: How can we capture the dependence of data influence on the optimization trajectory during training? To address this gap, we formalize the concept of trajectory-specific leave-one-out (LOO) influence, which quantifies the impact of removing a data point from a specific iteration during training, accounting for the exact sequence of data encountered and the model's optimization trajectory. However, exactly evaluating the trajectory-specific…
Peer Reviews
Decision·ICLR 2025 Oral
The method introduces a novel concept by capturing data influence in a trajectory-specific manner rather than assuming permutation invariance, which is a common limitation in conventional influence estimation methods. It outlines assumptions and derives an approximation error bound, lending theoretical credibility to the approach. The approach is designed with computational efficiency, including several techniques to reduce the memory and computational cost. It enables identification of high-val
The method is explicitly tailored for SGD and is not readily applicable to other popular optimizers like Adam. Although using SGD as a proxy is discussed, this limitation restricts the method's applicability to a broader range of models. The evaluation with ground truth focuses on specific datasets and model types (e.g., MNIST, MLP) due to the computational cost, which may limit the generalizability of the findings. Several assumptions are made in this paper, such as model layer independency, l
Quantifying data influence is an important task and is crucial active sample selection. However, existing method, the influence function, does not care the order of samples' arrival, making it unsuitable for vast majority of stochastic algorithms. This paper address an important research gap. This paper is mostly well-written. The mathematical ideas and their intuitions are clearly presented and is easy to understand. The discovery of stages of sample influence is potentially a significant c
Probably due to the lack of a dedicated related work section, it is not clear where does the authors' contribution begin. For example, has TSLOO been studied before or is it a novel concept proposed by the authors? In Section 2, the first paragraph seems to suggest that this is authors' proposal. However, a later sentence says, "while the technique of unrolled differentiation Hara et al., 2019 explicitly aims to approximate TSLOO ..." It seems the idea of TSLOO already exists in earlier works.
1. The motivation of this paper is clear: modern training paradigms—especially for foundation models using stochastic algorithms and non-convergent, multi-stage curricula—are sensitive to data ordering. 2. It introduces a computationally efficient embedding method, making it feasible to apply influence estimation to large-scale models without retraining. 3. Empirical results demonstrate high fidelity in data influence estimation and reveal nuanced phases in training that inform efficient data se
1. Limited Real-World Validation: While the paper’s experiments demonstrate high fidelity on small datasets like MNIST and reduced subsets of larger datasets (e.g., 1% of the Pile), the method may not have been fully validated on more challenging, real-world datasets. This leaves questions about its robustness and scalability when applied to diverse, large-scale data used in production. 2. Potential Overhead in Implementation: While the method reduces some computational costs, it still requires
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScheduling and Timetabling Solutions
