Proper Dataset Valuation by Pointwise Mutual Information
Shuran Zheng, Xuan Qi, Rui Ray Chen, Yongchan Kwon, and James Zou

TL;DR
This paper introduces an information-theoretic framework using mutual information to evaluate dataset quality, aiming to improve data curation by focusing on informativeness about true model parameters rather than test set resemblance.
Contribution
It proposes a novel mutual information-based method for dataset valuation, addressing limitations of heuristic and test-score-based evaluation methods.
Findings
Mutual information effectively measures dataset informativeness about true model parameters.
Traditional evaluation methods can overvalue data that overfits test sets.
The proposed method aligns dataset scores with true informativeness, improving data curation.
Abstract
Data plays a central role in advancements in modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of principled and effective data curation methods in recent years. However, existing methods largely rely on heuristics, and whether they are truly effective remains unclear. For instance, standard evaluation methods that assess a trained model's performance on specific benchmarks may incentivize assigning high scores to data that merely resembles the test set. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this issue, we propose an information-theoretic framework for evaluating data curation methods. We define dataset quality in terms of its informativeness about the true model parameters, formalized using the Blackwell ordering of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence · Forecasting Techniques and Applications · Customer churn and segmentation
MethodsSparse Evolutionary Training
