Generalization error bounds in semi-supervised classification under the cluster assumption
Philippe Rigollet (PMA)

TL;DR
This paper derives generalization error bounds for semi-supervised classification under the cluster assumption, leveraging density level set estimation to improve convergence rates with both labeled and unlabeled data.
Contribution
It provides a formal mathematical formulation of the cluster assumption and introduces a method that exploits it for faster convergence in semi-supervised learning.
Findings
Achieves fast convergence rates using unlabeled data
Formalizes the cluster assumption mathematically
Demonstrates benefits of density level set estimation
Abstract
We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known "cluster assumption" as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Statistical Methods and Models · Bayesian Methods and Mixture Models
