Are labels informative in semi-supervised learning? -- Estimating and leveraging the missing-data mechanism
Aude Sportisse (CRISAM,3iA C\^ote d'Azur, MAASAI, UCA), Hugo Schmutz, (CRISAM, TIRO-MATOs, JAD,3iA C\^ote d'Azur, MAASAI, UCA), Olivier Humbert, (UNICANCER/CAL, TIRO-MATOs, UCA), Charles Bouveyron (MAASAI, CRISAM,3iA, C\^ote d'Azur, UCA), Pierre-Alexandre Mattei (MAASAI, CRISAM

TL;DR
This paper investigates the impact of informative labels in semi-supervised learning, proposing methods to estimate the missing-data mechanism, debias algorithms, and test label informativeness, with applications to medical datasets.
Contribution
It introduces a novel approach to estimate and leverage the missing-data mechanism in SSL, including a likelihood ratio test for label informativeness.
Findings
Effective estimation of missing-data mechanism demonstrated.
Debiasing SSL algorithms improves performance.
Method applied successfully to medical datasets.
Abstract
Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of ``informative'' labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStatistical Methods and Inference · Machine Learning and Data Classification · Statistical Methods and Bayesian Inference
MethodsTest
