Distilled Datamodel with Reverse Gradient Matching
Jingwen Ye, Ruonan Yu, Songhua Liu, Xinchao Wang

TL;DR
This paper presents a novel, efficient framework for assessing the impact of training data on large models by using a distilled synset and reverse gradient matching, significantly reducing computational costs.
Contribution
The authors introduce a new method combining offline data influence approximation with online evaluation to speed up leave-one-out data impact analysis.
Findings
Achieves comparable data impact assessment accuracy to retraining methods.
Significantly reduces computational time for data attribution tasks.
Effective in evaluating data quality and influence in large-scale models.
Abstract
The proliferation of large-scale AI models trained on extensive datasets has revolutionized machine learning. With these models taking on increasingly central roles in various applications, the need to understand their behavior and enhance interpretability has become paramount. To investigate the impact of changes in training data on a pre-trained model, a common approach is leave-one-out retraining. This entails systematically altering the training dataset by removing specific samples to observe resulting changes within the model. However, retraining the model for each altered dataset presents a significant computational challenge, given the need to perform this operation for every dataset variation. In this paper, we introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. During the offline training phase, we approximate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph Theory and Algorithms · Data Management and Algorithms · 3D Shape Modeling and Analysis
