On the Feasibility of In-Context Probing for Data Attribution

Cathy Jiao; Gary Gao; Aditi Raghunathan; Chenyan Xiong

arXiv:2407.12259·cs.CL·February 12, 2025

On the Feasibility of In-Context Probing for Data Attribution

Cathy Jiao, Gary Gao, Aditi Raghunathan, Chenyan Xiong

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper explores in-context probing as a rapid alternative to gradient-based data attribution methods for identifying influential training data, demonstrating their correlation and similar downstream effects across NLP and synthetic tasks.

Contribution

It introduces in-context probing as a computationally efficient proxy for gradient-based data attribution, validated through empirical NLP experiments and synthetic data analysis.

Findings

01

ICP correlates well with gradient-based attribution on NLP tasks

02

Fine-tuning on influential data from both methods yields similar performance

03

Synthetic data experiments support the connection between ICP and gradient methods

Abstract

Data attribution methods are used to measure the contribution of training data towards model outputs, and have several important applications in areas such as dataset curation and model interpretability. However, many standard data attribution methods, such as influence functions, utilize model gradients and are computationally expensive. In our paper, we show in-context probing (ICP) -- prompting a LLM -- can serve as a fast proxy for gradient-based data attribution for data selection under conditions contingent on data similarity. We study this connection empirically on standard NLP tasks, and show that ICP and gradient-based data attribution are well-correlated in identifying influential training data for tasks that share similar task type and content as the training data. Additionally, fine-tuning models on influential data selected by both methods achieves comparable downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cxcscmu/InContextDataValuation
noneOfficial

Videos

On the Feasibility of In-Context Probing for Data Attribution· underline

Taxonomy

TopicsAdvanced Statistical Process Monitoring · Healthcare Operations and Scheduling Optimization