Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

Minghui Xu; Qi Luo; Kun Li

arXiv:2604.22893·cs.LG·April 28, 2026

Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

Minghui Xu, Qi Luo, Kun Li

PDF

TL;DR

This paper introduces a utility-aware data valuation framework for LLMs that measures token-level information, empirical training gains, and cryptographic verifiability, enabling fair and transparent data pricing.

Contribution

It proposes a novel multi-layered approach combining information metrics, influence-based gain measurement, and cryptographic commitments for data valuation in LLMs.

Findings

01

Proxy-based empirical gain aligns closely with actual utility.

02

Outperforms traditional row-count and token-count baselines.

03

Validated on instruction, reasoning, and code tasks.

Abstract

Traditional data valuation methods based on ``row-count $\times$ quality coefficient'' paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.