Semi-Supervised Sparse Gaussian Classification: Provable Benefits of   Unlabeled Data

Eyar Azar; Boaz Nadler

arXiv:2409.03335·stat.ML·September 6, 2024

Semi-Supervised Sparse Gaussian Classification: Provable Benefits of Unlabeled Data

Eyar Azar, Boaz Nadler

PDF

Open Access 1 Video

TL;DR

This paper provides a theoretical analysis demonstrating that semi-supervised learning can significantly improve high-dimensional sparse Gaussian classification by effectively combining labeled and unlabeled data, especially in feature selection.

Contribution

The work identifies regimes where semi-supervised learning offers provable advantages and can be computationally efficient for high-dimensional sparse Gaussian classification.

Findings

01

SSL improves feature selection accuracy in certain regimes.

02

Polynomial-time SSL classifiers outperform supervised-only methods.

03

Simulations confirm theoretical benefits of SSL in high dimensions.

Abstract

The premise of semi-supervised learning (SSL) is that combining labeled and unlabeled data yields significantly more accurate models. Despite empirical successes, the theoretical understanding of SSL is still far from complete. In this work, we study SSL for high dimensional sparse Gaussian classification. To construct an accurate classifier a key task is feature selection, detecting the few variables that separate the two classes. % For this SSL setting, we analyze information theoretic lower bounds for accurate feature selection as well as computational lower bounds, assuming the low-degree likelihood hardness conjecture. % Our key contribution is the identification of a regime in the problem parameters (dimension, sparsity, number of labeled and unlabeled samples) where SSL is guaranteed to be advantageous for classification. Specifically, there is a regime where it is possible to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Semi-Supervised Sparse Gaussian Classification: Provable Benefits of Unlabeled Data· slideslive

Taxonomy

TopicsMachine Learning and Data Classification

MethodsFeature Selection