Reliability-based cleaning of noisy training labels with inductive conformal prediction in multi-modal biomedical data mining
Xianghao Zhan, Qinmei Xu, Yuanning Zheng, Guangming Lu, Olivier, Gevaert

TL;DR
This paper introduces a reliability-based data cleaning method using inductive conformal prediction to improve classification accuracy in noisy, multi-modal biomedical datasets without requiring extensive manual labeling.
Contribution
The proposed method leverages ICP-derived reliability metrics to identify and correct mislabeled data, enhancing classification performance across diverse biomedical modalities.
Findings
Significant accuracy improvements in drug-induced liver injury classification
Enhanced AUROC and AUPRC in COVID-19 patient prediction
High accuracy and F1 score gains in breast cancer subtyping
Abstract
Accurately labeling biomedical data presents a challenge. Traditional semi-supervised learning methods often under-utilize available unlabeled data. To address this, we propose a novel reliability-based training data cleaning method employing inductive conformal prediction (ICP). This method capitalizes on a small set of accurately labeled training data and leverages ICP-calculated reliability metrics to rectify mislabeled data and outliers within vast quantities of noisy training data. The efficacy of the method is validated across three classification tasks within distinct modalities: filtering drug-induced-liver-injury (DILI) literature with title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Machine Learning and Data Classification
