Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

Yide Ran; Jianwen Xie; Minghui Wang; Wenjin Zheng; Denghui Zhang; Chuan Li; Zhaozhuo Xu

arXiv:2604.16197·cs.LG·April 20, 2026

Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

Yide Ran, Jianwen Xie, Minghui Wang, Wenjin Zheng, Denghui Zhang, Chuan Li, Zhaozhuo Xu

PDF

TL;DR

RISE is a scalable, influence-based readout method for large language models that reduces memory requirements and enables effective data attribution and valuation.

Contribution

Introduces RISE, a novel influence sketching estimator that focuses on output layer hotspots, enabling scalable data attribution for large language models.

Findings

01

RISE reduces index storage by up to 112× compared to RapidIn.

02

Scales to 32B parameter LLMs where other methods become infeasible.

03

Effectively detects backdoor data and improves downstream tasks.

Abstract

Data attribution and valuation are critical for understanding data-model synergy for Large Language Models (LLMs), yet existing gradient-based methods suffer from scalability challenges on LLMs. Inspired by human cognition, where decision making relies on a focused readout of relevant memories rather than replaying all pathways, we introduce RISE (Readout Influence Sketching Estimator). Instead of computing and indexing gradients across the entire LLM, RISE focuses on influence hotspots at the output layer, where influence signals concentrate, and the gradient admits a decomposed outer-product form. This enables a dual-channel representation combining a lexical residual channel (RH) and a semantic projected-error channel (GH). Applying CountSketch projections to these channels achieves strong compression while maintaining accurate attribution. Across the OLMo (1B-32B) and Pythia…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.