# Semi-supervised Gaussian mixture modelling with a missing-data mechanism   in R

**Authors:** Ziyang Lyu, Daniel Ahfock, Ryan Thompson, Geoffrey J. McLachlan

arXiv: 2302.13206 · 2024-04-18

## TL;DR

This paper introduces gmmsslm, an R package for semi-supervised Gaussian mixture modeling that accounts for missing labels using a logistic missingness mechanism, improving classifier accuracy even with partially labeled data.

## Contribution

The paper presents an implementation of a Gaussian mixture modeling framework with a missing data mechanism for multiple classes and arbitrary covariances in R.

## Key findings

- The package effectively estimates classifiers from partially labeled data.
- Incorporating a missingness mechanism improves classification accuracy.
- Demonstrated on real datasets, showing practical utility.

## Abstract

Semi-supervised learning is being extensively applied to estimate classifiers from training data in which not all the labels of the feature vectors are available. We present gmmsslm, an R package for estimating the Bayes' classifier from such partially classified data in the case where the feature vector has a multivariate Gaussian (normal) distribution in each of the predefined classes. Our package implements a recently proposed Gaussian mixture modelling framework that incorporates a missingness mechanism for the missing labels in which the probability of a missing label is represented via a logistic model with covariates that depend on the entropy of the feature vector. Under this framework, it has been shown that the accuracy of the Bayes' classifier formed from the Gaussian mixture model fitted to the partially classified training data can even have lower error rate than if it were estimated from the sample completely classified. This result was established in the particular case of two Gaussian classes with a common covariance matrix. Here, we focus on the effective implementation of an algorithm for multiple Gaussian classes with arbitrary covariance matrices. A strategy for initialising the algorithm is discussed and illustrated. The new package is demonstrated on some real data.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2302.13206/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/2302.13206/full.md

## References

38 references — full list in the complete paper: https://tomesphere.com/paper/2302.13206/full.md

---
Source: https://tomesphere.com/paper/2302.13206