Generalization error bounds in semi-supervised classification under the   cluster assumption

Philippe Rigollet (PMA)

arXiv:math/0604233·math.ST·June 13, 2007·110 cites

Generalization error bounds in semi-supervised classification under the cluster assumption

Philippe Rigollet (PMA)

PDF

Open Access

TL;DR

This paper derives generalization error bounds for semi-supervised classification under the cluster assumption, leveraging density level set estimation to improve convergence rates with both labeled and unlabeled data.

Contribution

It provides a formal mathematical formulation of the cluster assumption and introduces a method that exploits it for faster convergence in semi-supervised learning.

Findings

01

Achieves fast convergence rates using unlabeled data

02

Formalizes the cluster assumption mathematically

03

Demonstrates benefits of density level set estimation

Abstract

We consider semi-supervised classification when part of the available data is unlabeled. These unlabeled data can be useful for the classification problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known "cluster assumption" as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Statistical Methods and Models · Bayesian Methods and Mixture Models