An Apparent Paradox: A Classifier Trained from a Partially Classified   Sample May Have Smaller Expected Error Rate Than That If the Sample Were   Completely Classified

Daniel Ahfock; Geoffrey J. McLachlan

arXiv:1910.09189·stat.ME·November 11, 2019·1 cites

An Apparent Paradox: A Classifier Trained from a Partially Classified Sample May Have Smaller Expected Error Rate Than That If the Sample Were Completely Classified

Daniel Ahfock, Geoffrey J. McLachlan

PDF

Open Access

TL;DR

This paper explores semi-supervised learning, proposing a framework where non-random missing labels can lead to classifiers with smaller expected error rates than fully labeled samples, challenging conventional assumptions.

Contribution

It introduces a novel missing data framework modeling label missingness via entropy-dependent logistic models, revealing that partially labeled data can sometimes yield better classifiers.

Findings

01

Classifiers from partially labeled data can outperform those from fully labeled data.

02

Missing labels concentrated in high-entropy regions influence classifier performance.

03

Modeling label missingness with entropy-dependent logistic models explains the paradox.

Abstract

There has been increasing interest in using semi-supervised learning to form a classifier. As is well known, the (Fisher) information in an unclassified feature with unknown class label is less (considerably less for weakly separated classes) than that of a classified feature which has known class label. Hence assuming that the labels of the unclassified features are randomly missing or their missing-label mechanism is simply ignored, the expected error rate of a classifier formed from a partially classified sample is greater than that if the sample were completely classified. We propose to treat the labels of the unclassified features as missing data and to introduce a framework for their missingness in situations where these labels are not randomly missing. An examination of several partially classified data sets in the literature suggests that the unclassified features are not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Imbalanced Data Classification Techniques · Advanced Statistical Methods and Models