The Role of Pseudo-labels in Self-training Linear Classifiers on High-dimensional Gaussian Mixture Data
Takashi Takahashi

TL;DR
This paper provides a theoretical analysis of self-training for linear classifiers on high-dimensional Gaussian mixture data, revealing how iteration count and label imbalance affect generalization and proposing heuristics to improve performance.
Contribution
It offers a sharp asymptotic characterization of self-training dynamics, explaining its effectiveness and limitations, and introduces heuristics to handle label imbalance.
Findings
Self-training improves generalization by fitting reliable pseudo-labels early and refining the decision boundary with soft labels later.
Multiple iterations allow incremental improvement of the classifier's direction, extracting near-noiseless information.
Heuristics can mitigate performance drops caused by label imbalance, making self-training comparable to supervised learning.
Abstract
Self-training (ST) is a simple yet effective semi-supervised learning method. However, why and how ST improves generalization performance by using potentially erroneous pseudo-labels is still not well understood. To deepen the understanding of ST, we derive and analyze a sharp characterization of the behavior of iterative ST when training a linear classifier by minimizing the ridge-regularized convex loss on binary Gaussian mixtures, in the asymptotic limit where input dimension and data size diverge proportionally. The results show that ST improves generalization in different ways depending on the number of iterations. When the number of iterations is small, ST improves generalization performance by fitting the model to relatively reliable pseudo-labels and updating the model parameters by a large amount at each iteration. This suggests that ST works intuitively. On the other hand,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Machine Learning and Data Classification
