Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration
Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng

TL;DR
This paper introduces DDA, a novel training data attribution method for large language models that improves influence function accuracy by addressing fitting errors, leading to better data attribution and model interpretability.
Contribution
The paper proposes DDA, a new TDA method that enhances influence functions by removing bias and smoothing influence scores, improving attribution accuracy for large language models.
Findings
DDA achieves an averaged AUC of 91.64%.
DDA outperforms existing TDA methods.
DDA is effective across various models and data sources.
Abstract
The black-box nature of large language models (LLMs) poses challenges in interpreting results, impacting issues such as data intellectual property protection and hallucination tracing. Training data attribution (TDA) methods are considered effective solutions to address these challenges. Most recent TDA methods rely on influence functions, assuming the model achieves minimized empirical risk. However, achieving this criterion is difficult, and sourcing accuracy can be compromised by fitting errors during model training. In this paper, we introduce a novel TDA method called Debias and Denoise Attribution (DDA), which enhances influence functions by addressing fitting errors. Specifically, the debias strategy seeks to improve the performance of influence functions by eliminating the knowledge bias present in the base model before fine-tuning, while the denoise strategy aims to reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
MethodsBalanced Selection
