Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs
Minghui Xu, Qi Luo, Kun Li

TL;DR
This paper introduces a utility-aware data valuation framework for LLMs that measures token-level information, empirical training gains, and cryptographic verifiability, enabling fair and transparent data pricing.
Contribution
It proposes a novel multi-layered approach combining information metrics, influence-based gain measurement, and cryptographic commitments for data valuation in LLMs.
Findings
Proxy-based empirical gain aligns closely with actual utility.
Outperforms traditional row-count and token-count baselines.
Validated on instruction, reasoning, and code tasks.
Abstract
Traditional data valuation methods based on ``row-count quality coefficient'' paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
