Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
Yide Ran, Jianwen Xie, Minghui Wang, Wenjin Zheng, Denghui Zhang, Chuan Li, Zhaozhuo Xu

TL;DR
RISE is a scalable, influence-based readout method for large language models that reduces memory requirements and enables effective data attribution and valuation.
Contribution
Introduces RISE, a novel influence sketching estimator that focuses on output layer hotspots, enabling scalable data attribution for large language models.
Findings
RISE reduces index storage by up to 112× compared to RapidIn.
Scales to 32B parameter LLMs where other methods become infeasible.
Effectively detects backdoor data and improves downstream tasks.
Abstract
Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods suffer from scalability challenges on LLMs. Inspired by human cognition, where decision making relies on a focused readout of relevant memories rather than replaying all pathways, we introduce RISE (Readout Influence Sketching Estimator). Instead of computing and indexing gradients across the entire LLM, RISE focuses on influence hotspots at the output layer, where influence signals concentrate, and the gradient admits a decomposed outer-product form. This enables a dual-channel representation combining a lexical residual channel (RH) and a semantic projected-error channel (GH). Applying CountSketch projections to these channels achieves strong compression while maintaining accurate attribution. Across the OLMo (1B-32B) and Pythia…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
