On the Impact of the Utility in Semivalue-based Data Valuation
M\'elissa Tamine, Benjamin Heymann, Maxime Vono, Patrick Loiseau

TL;DR
This paper introduces a geometric approach to assess the robustness of semivalue-based data valuation under different utility choices, providing a practical metric to predict how valuations shift with utility variations.
Contribution
It proposes a dataset's spatial signature and a robustness metric, enabling practitioners to evaluate the stability of data valuations across utility functions.
Findings
The methodology accurately predicts valuation shifts with utility changes.
Robustness varies significantly with the choice of semivalue.
The approach is validated across multiple datasets and semivalues.
Abstract
Semivalue-based data valuation uses cooperative-game theory intuitions to assign each data point a value reflecting its contribution to a downstream task. Still, those values depend on the practitioner's choice of utility, raising the question: How robust is semivalue-based data valuation to changes in the utility? This issue is critical when the utility is set as a trade-off between several criteria and when practitioners must select among multiple equally valid utilities. We address this by introducing the notion of a dataset's spatial signature: given a semivalue, we embed each data point into a lower-dimensional space in which any utility becomes a linear functional, making the data valuation framework amenable to a simpler geometric picture. Building on this, we propose a practical methodology centered on an explicit robustness metric that informs practitioners whether and by how…
Peer Reviews
Decision·ICLR 2026 Poster
1. The introduction of a novel geometric perspective on and a formal robustness metric for semi-value based data valuation offers a way to benchmark data valuation frameworks and test their stability. In real-world datasets, the utility function can change anytime, reinforcing the importance of this metric. 2. The authors provide a rigorous evaluation of their metric on existing data valuation methods and utility functions. 3. The paper also offers insights into why some valuation methods ten
1. The paper focuses on utility functions specific to binary classification and raises the question of whether the findings will translate to other utility functions / learning tasks. Similarly the evaluations focus on two utility systems, and multi-utility systems are yet to be explored. 2. Benchmarking the stability of data valuation methods is not new. The key contribution of the paper is the geometric perspective - and this paper would be strengthened by additional experiments that show the
The paper is clearly written and easy to follow. It presents an elegant and intuitive geometric representation that translates variations in utility functions into simple linear projections. The proposed method is validated across a wide range of datasets and semivalues (Shapley, Beta Shapley, and Banzhaf). The authors have thoroughly addressed the major limitations of the previous version by adding regression and multiclass tasks, introducing a formal robustness metric, and conducting experi
I have reviewed this paper before, and the authors have resolved most of my confusion, but I still have a remaining question. The experiments on multiple-valid-utility scenarios remain somewhat limited and would benefit from further expansion.
Authors present another perspective on how the choice of the utility affects the resulting data values and rankings during data valuation. To mitigate the noise and randomness introduced by Monte Carlo sampling, they propose the use of aligned sampling. The theory and empirical setups are well-written and easy to follow. Authors conduct multiple experiments to assess the efficacy of their metric, and also make their code available.
Restricting the utility to a 2D space spanned by two fixed base utilities u1 and u2 (and extension to class-wise utilities ) makes the robustness metric hard to scale. It's possible that some information is lost when embedding data to a low dimensional space. It's unclear to me how this affects the results. The data values from Monte Carlo approximations in regression problems tend to be unreliable due to greater variability in utility estimates when sample sizes are small. So, is there a spec
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications · Advanced Statistical Process Monitoring · Explainable Artificial Intelligence (XAI)
MethodsSparse Evolutionary Training
