The Use of Unlabeled Data in Predictive Modeling
Feng Liang, Sayan Mukherjee, Mike West

TL;DR
This paper reviews the statistical foundations of using unlabeled data in predictive modeling, emphasizing when and why unlabeled data can improve accuracy, supported by examples and real data analyses.
Contribution
It clarifies the theoretical basis for semi-supervised learning, connecting classical sampling concepts with modern predictive modeling techniques.
Findings
Unlabeled data can enhance predictive accuracy under specific conditions.
The paper links traditional sampling theory to semi-supervised learning.
Real data examples demonstrate practical benefits of unlabeled data.
Abstract
The incorporation of unlabeled data in regression and classification analysis is an increasing focus of the applied statistics and machine learning literatures, with a number of recent examples demonstrating the potential for unlabeled data to contribute to improved predictive accuracy. The statistical basis for this semisupervised analysis does not appear to have been well delineated; as a result, the underlying theory and rationale may be underappreciated, especially by nonstatisticians. There is also room for statisticians to become more fully engaged in the vigorous research in this important area of intersection of the statistical and computer sciences. Much of the theoretical work in the literature has focused, for example, on geometric and structural properties of the unlabeled data in the context of particular algorithms, rather than probabilistic and statistical questions. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
