Accumulative SGD Influence Estimation for Data Attribution
Yunxiao Shi, Shuo Yang, Yixin Su, Rui Zhang, Min Xu

TL;DR
This paper introduces ACC-SGD-IE, a trajectory-aware influence estimator for data attribution in AI, which improves accuracy over existing methods by propagating influence across training steps and reducing error, especially in long training regimes.
Contribution
The paper presents ACC-SGD-IE, a novel influence estimation method that accounts for cross-epoch effects, providing tighter error bounds and improved accuracy in data influence estimation.
Findings
ACC-SGD-IE achieves geometric error contraction in convex settings.
It provides tighter error bounds in non-convex regimes.
Empirically, it more accurately flags noisy data and improves model performance after data cleaning.
Abstract
Modern data-centric AI needs precise per-sample influence. Standard SGD-IE approximates leave-one-out effects by summing per-epoch surrogates and ignores cross-epoch compounding, which misranks critical examples. We propose ACC-SGD-IE, a trajectory-aware estimator that propagates the leave-one-out perturbation across training and updates an accumulative influence state at each step. In smooth strongly convex settings it achieves geometric error contraction and, in smooth non-convex regimes, it tightens error bounds; larger mini-batches further reduce constants. Empirically, on Adult, 20 Newsgroups, and MNIST under clean and corrupted data and both convex and non-convex training, ACC-SGD-IE yields more accurate influence estimates, especially over long epochs. For downstream data cleansing it more reliably flags noisy samples, producing models trained on ACC-SGD-IE cleaned data that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
