Data Valuation for Medical Imaging Using Shapley Value: Application on A Large-scale Chest X-ray Dataset
Siyi Tang, Amirata Ghorbani, Rikiya Yamashita, Sameer Rehman, Jared A., Dunnmon, James Zou, Daniel L. Rubin

TL;DR
This paper demonstrates that data Shapley can effectively identify low quality and mislabeled images in large-scale chest X-ray datasets, improving pneumonia detection models by filtering data based on valuation scores.
Contribution
The study applies data Shapley for data valuation in medical imaging, showing its effectiveness in identifying low quality and mislabeled data to enhance model performance.
Findings
Removing high Shapley value data decreases performance.
Removing low Shapley value data improves performance.
High Shapley data correlates with true pneumonia cases.
Abstract
The reliability of machine learning models can be compromised when trained on low quality data. Many large-scale medical imaging datasets contain low quality labels extracted from sources such as medical reports. Moreover, images within a dataset may have heterogeneous quality due to artifacts and biases arising from equipment or measurement errors. Therefore, algorithms that can automatically identify low quality data are highly desired. In this study, we used data Shapley, a data valuation metric, to quantify the value of training data to the performance of a pneumonia detection algorithm in a large chest X-ray dataset. We characterized the effectiveness of data Shapley in identifying low quality versus valuable data for pneumonia detection. We found that removing training data with high Shapley values decreased the pneumonia detection performance, whereas removing data with low…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
