DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and   Diffusion Models

Yongchan Kwon; Eric Wu; Kevin Wu; James Zou

arXiv:2310.00902·cs.LG·March 14, 2024·5 cites

DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

Yongchan Kwon, Eric Wu, Kevin Wu, James Zou

PDF

Open Access 1 Repo 1 Video

TL;DR

DataInf is a novel, efficient influence approximation method designed for large-scale generative models, significantly reducing computational costs while accurately identifying influential training data points and mislabeled data.

Contribution

It introduces a closed-form influence approximation tailored for large models and parameter-efficient fine-tuning, outperforming existing methods in speed and accuracy.

Findings

01

DataInf accurately approximates influence scores in large models.

02

It is orders of magnitude faster than existing influence computation algorithms.

03

DataInf effectively identifies influential and mislabeled data points in various models.

Abstract

Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ykwon0407/datainf
pytorchOfficial

Videos

DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models· slideslive

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques