A case for data valuation transparency via DValCards
Keziah Naggita, Julienne LaChance

TL;DR
This paper highlights the biases and instability of current data valuation methods in ML, demonstrating their ethical and technical issues, and proposes DValCards to promote transparency and responsible use.
Contribution
It introduces DValCards, a framework for transparent data valuation, addressing biases and promoting ethical practices in data markets and ML systems.
Findings
Data pre-processing can significantly alter data value estimates.
Data valuation can increase class imbalance when subsampling.
Underrepresented groups may be undervalued by current metrics.
Abstract
Following the rise in popularity of data-centric machine learning (ML), various data valuation methods have been proposed to quantify the contribution of each datapoint to desired ML model performance metrics (e.g., accuracy). Beyond the technical applications of data valuation methods (e.g., data cleaning, data acquisition, etc.), it has been suggested that within the context of data markets, data buyers might utilize such methods to fairly compensate data owners. Here we demonstrate that data valuation metrics are inherently biased and unstable under simple algorithmic design choices, resulting in both technical and ethical implications. By analyzing 9 tabular classification datasets and 6 data valuation methods, we illustrate how (1) common and inexpensive data pre-processing techniques can drastically alter estimated data values; (2) subsampling via data valuation metrics may…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper is easy to follow.
- The paper's technical contribution is a bit limited, mainly focusing on evaluating existing methods. - The findings from the paper are not novel. (1) Regarding the sensitivity to data imputation methods: data valuation fundamentally determines the contribution of a given data point based on the other data used together for training; hence, it is straightforward to see that the value of a data point would change depending on the choice of the imputation method because different imputation metho
The paper is well-motivated and conveys an essential message: existing data valuation methods, primarily designed for machine learning, may be unsuitable for data compensation in data markets. It highlights various practical challenges that emerge when these methods are repurposed for economic applications. Backed by comprehensive experimental analysis, the paper’s findings offer valuable insights and serve as practical guidelines for the effective design and implementation of data valuation met
The paper raises an important issue, though its main limitation appears to be the lack of a fundamental solution. While DValCards help mitigate the issues of instability and fairness, they primarily serve as a more detailed documentation tool for data valuation methods. The paper makes a valuable contribution by highlighting the challenges of existing data valuation approaches through extensive empirical evaluations, including issues related to instability, class imbalance, and fairness. Howeve
- The authors provide a elaborate and comprehensive analysis of the impact of preprocessing techniques and class imbalance on data valuation metrics, especially imputation methods and their effects on class balance and rank stability. 12 Open-ML datasets are considered and 4 Data Valuation frameworks are chosen for comparison. - The introduction of DValCards is a valuable contribution to the field, providing a standardized framework for reporting critical information about data valuation me
- The effectiveness of imputation preprocessing methods in standard data valuation tasks (eg. weighted training, noisy label detection) is not thoroughly evaluated, and the authors could provide more evidence. Instability of values is known in Data Valuation literature, but specifics with respect to imputation methods are not widely studied. - Since this paper is trying to unify a setting for all Data Valuation methods, it could benefit from expanding its scope to include runtime analysis (
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Ethics and Social Impacts of AI
