An Apparent Paradox: A Classifier Trained from a Partially Classified Sample May Have Smaller Expected Error Rate Than That If the Sample Were Completely Classified
Daniel Ahfock, Geoffrey J. McLachlan

TL;DR
This paper explores semi-supervised learning, proposing a framework where non-random missing labels can lead to classifiers with smaller expected error rates than fully labeled samples, challenging conventional assumptions.
Contribution
It introduces a novel missing data framework modeling label missingness via entropy-dependent logistic models, revealing that partially labeled data can sometimes yield better classifiers.
Findings
Classifiers from partially labeled data can outperform those from fully labeled data.
Missing labels concentrated in high-entropy regions influence classifier performance.
Modeling label missingness with entropy-dependent logistic models explains the paradox.
Abstract
There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence assuming that the labels of the unclassified features are randomly missing or their missing-label mechanism is simply ignored, the expected error rate of a classifier formed from a partially classified sample is greater than that if the sample were completely classified. We propose to treat the labels of the unclassified features as missing data and to introduce a framework for their missingness in situations where these labels are not randomly missing. An examination of several partially classified data sets in the literature suggests that the unclassified features are not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Imbalanced Data Classification Techniques · Advanced Statistical Methods and Models
