How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

Junwei Deng; Pingbang Hu; Suliang Jin; Hao Lu; Jiachen T. Wang; Shichang Zhang; Jiaqi W. Ma

arXiv:2605.18814·cs.LG·May 20, 2026

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

Junwei Deng, Pingbang Hu, Suliang Jin, Hao Lu, Jiachen T. Wang, Shichang Zhang, Jiaqi W. Ma

PDF

TL;DR

This paper systematically analyzes error sources in trajectory-based data attribution, proposes remedies like AdamW-influence, and offers practical guidelines for reliable data influence estimation and selection.

Contribution

It provides the first comprehensive error analysis, introduces AdamW-specific influence, and develops a unified, actionable framework for data selection in machine learning models.

Findings

01

AdamW-influence improves correlation by 10% to 300% across models.

02

Identified learning rate and trajectory length as key factors affecting approximation error.

03

Proposed a K-step look-ahead framework for effective data selection.

Abstract

Trajectory-based data attribution methods estimate the influence of training samples on model predictions by unrolling the training trajectory. They are widely used in applications such as data selection, data valuation, and model diagnosis, but there is a lack of comprehensive error analysis of these methods, raising concerns about method faithfulness and hindering reliable deployment. In this work, we provide the first systematic analysis of error sources in trajectory-based data attribution, together with concrete remedies to mitigate them and practical guidelines for downstream use. We organize the total error into three categories, config-level, algorithm-level, and system-level. We make three contributions. First, we identify optimizer mismatch as the dominant config-level error: existing methods derive their attribution under the assumption of SGD, even for models trained with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.