Geometric Data Valuation via Leverage Scores
Rodrigo Mendoza-Smith

TL;DR
This paper introduces a geometric data valuation method using leverage scores, offering a computationally efficient alternative to Shapley values that effectively measures data importance and improves model training and active learning.
Contribution
It proposes leverage scores as a scalable, geometric alternative to Shapley valuation, connecting data importance to classical design criteria and demonstrating practical benefits.
Findings
Leverage scores satisfy key Shapley axioms.
Ridge leverage scores relate to A- and D-optimal design.
Leverage sampling improves active learning performance.
Abstract
Shapley data valuation provides a principled, axiomatic framework for assigning importance to individual datapoints, and has gained traction in dataset curation, pruning, and pricing. However, it is a combinatorial measure that requires evaluating marginal utility across all subsets of the data, making it computationally infeasible at scale. We propose a geometric alternative based on statistical leverage scores, which quantify each datapoint's structural influence in the representation space by measuring how much it extends the span of the dataset and contributes to the effective dimensionality of the training problem. We show that our scores satisfy the dummy, efficiency, and symmetry axioms of Shapley valuation and that extending them to \emph{ridge leverage scores} yields strictly positive marginal gains that connect naturally to classical A- and D-optimal design criteria. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Machine Learning and Data Classification · Ethics and Social Impacts of AI
