On the Inflation of KNN-Shapley Value
Ziao Yang, Han Yue, Jian Chen, Hongfu Liu

TL;DR
This paper introduces Calibrated KNN-Shapley, a method to address value inflation in data valuation by calibrating the threshold for beneficial samples, improving detection of detrimental data across various learning scenarios.
Contribution
It proposes CKNN-Shapley, a calibration technique that mitigates value inflation in KNN-Shapley, enhancing data quality assessment in multiple practical learning settings.
Findings
CKNN-Shapley effectively reduces data valuation inflation.
It improves detection of detrimental samples in diverse scenarios.
The method extends to learning with noisy, streaming, and actively labeled data.
Abstract
Shapley value-based data valuation methods, originating from cooperative game theory, quantify the usefulness of each individual sample by considering its contribution to all possible training subsets. Despite their extensive applications, these methods encounter the challenge of value inflation - while samples with negative Shapley values are detrimental, some with positive values can also be harmful. This challenge prompts two fundamental questions: the suitability of zero as a threshold for distinguishing detrimental from beneficial samples and the determination of an appropriate threshold. To address these questions, we focus on KNN-Shapley and propose Calibrated KNN-Shapley (CKNN-Shapley), which calibrates zero as the threshold to distinguish detrimental samples from beneficial ones by mitigating the negative effects of small-sized training subsets. Through extensive experiments,…
Peer Reviews
Decision·Submitted to ICLR 2025
1) The paper is very well written,very well explained and self-contained. I was not familiar with the KNN shapley technique and was able to catch up quickly by reading the paper and looking at the provided references. The plots are very helpful as well. 2) The solution is proposed is very simple and arguable incremental but I think the impact of the small change makes the KNN-Shapley algorithm significantly better when using it in real-life applications. This is one of these cases where Ocham's
1) Since KNN is not an state-of-the-art algorithm in the modern practitioner toolbelt, my understanding of how this is used in real life would be that you would use CKNN-Shapley to clean the data and improve your training set quality and then go from there and use a more high performing algorithm e.g. GBM, XGboost, SVM etc. Is this the case? if it so can you add experiments where this two step is applied and see how this can improve the performance of these widely used algorithms? 2) Based on t
(1) The paper introduces Calibrated KNN-Shapley (CKNN-Shapley) as a novel solution to address value inflation in data valuation using KNN-Shapley. This approach is significant as it recalibrates the threshold, effectively distinguishing between beneficial and detrimental samples, which is critical for robust data valuation. (2) The paper conducts extensive experiments across various benchmark datasets, demonstrating CKNN-Shapley’s ability to outperform traditional KNN-Shapley and its variants. B
(1) Despite improvements over traditional Shapley calculations, CKNN-Shapley still incurs notable computational costs, particularly on large datasets or complex deep learning tasks. While more efficient than the original KNN-Shapley, the method may not yet be scalable for very large or high-dimensional datasets without further optimization. (2) CKNN-Shapley, like KNN-Shapley, relies on K-Nearest Neighbors as a surrogate model, which may limit its applicability to contexts where KNN is less effec
- The paper considers a problem setting (data evaluation) and a representative method (kNN Shapley value) that has high pratical importance. - The paper is well motivated and the proposed approache is simple and effective. - The paper also discusses the application of the proposed method in various applications.
- The motivation seems only exist for the kNN based Shapley value, not other approaches of Shapley value. The sentence in abstract and introduction seems exaggerated that the problem exists for all Shapley value methods. - The link between the motivation and the proposed method is vague. Specifically, does using the proposed $T$ value is designed to only solve the misclassified values? Is there probability that there are still misclassified values exist even using the probposed $T$? This also re
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms
