Semi-supervised Gaussian mixture modelling with a missing-data mechanism   in R

Ziyang Lyu; Daniel Ahfock; Ryan Thompson; Geoffrey J. McLachlan

arXiv:2302.13206·stat.CO·April 18, 2024

Semi-supervised Gaussian mixture modelling with a missing-data mechanism in R

Ziyang Lyu, Daniel Ahfock, Ryan Thompson, Geoffrey J. McLachlan

PDF

Open Access

TL;DR

This paper introduces gmmsslm, an R package for semi-supervised Gaussian mixture modeling that accounts for missing labels using a logistic missingness mechanism, improving classifier accuracy even with partially labeled data.

Contribution

The paper presents an implementation of a Gaussian mixture modeling framework with a missing data mechanism for multiple classes and arbitrary covariances in R.

Findings

01

The package effectively estimates classifiers from partially labeled data.

02

Incorporating a missingness mechanism improves classification accuracy.

03

Demonstrated on real datasets, showing practical utility.

Abstract

Semi-supervised learning is being extensively applied to estimate classifiers from training data in which not all the labels of the feature vectors are available. We present gmmsslm, an R package for estimating the Bayes' classifier from such partially classified data in the case where the feature vector has a multivariate Gaussian (normal) distribution in each of the predefined classes. Our package implements a recently proposed Gaussian mixture modelling framework that incorporates a missingness mechanism for the missing labels in which the probability of a missing label is represented via a logistic model with covariates that depend on the entropy of the feature vector. Under this framework, it has been shown that the accuracy of the Bayes' classifier formed from the Gaussian mixture model fitted to the partially classified training data can even have lower error rate than if it were…

Tables3

Table 1. Table 1: Structure of the model parameters in package gmmsslm .

Parameters	R arguments	Dimensions	Description
$𝝅$	pi	$g \times 1$	The mixing proportions
$𝝁$	mu	$p \times g$	The location parameters
$𝚺$	sigma	$p \times p$	The common covariance matrix
$𝚺$	sigma	$p \times p \times g$	The unequal covariance matrices
$𝝃$	xi	$2 \times 1$	The missing-data parameters

Table 2. Table 2: Arguments of the gmmsslm() function.

R arguments	Description
dat	An $n \times p$ matrix where each row represents an individual observation
zm	An $n$ -dimensional vector containing the class labels including the missing label denoted as NA
pi	A $g$ -dimensional vector for the initial values of the mixing proportions
mu	A $p \times g$ matrix for the initial values of the location parameters
sigma	A $p \times p$ matrix, or a $p \times p \times g$ array, for the initial values of the covariance matrices. The model is fit with a common covariance matrix if sigma is a $p \times p$ covariance matrix; otherwise the model is fit with unequal covariance matrices
paralist	A list containing pi, mu, and sigma; these arguments need not be supplied if paralist is supplied
xi	A 2-dimensional vector containing the initial values of the coefficients in the logistic function of the Shannon entropy
type	One of three types of Gaussian mixture models as follows: ‘full’ fits the model to a partially classified sample on the basis of the full likelihood by taking into account the missing-data mechanism, ‘ign’ fits the model to a partially classified sample based on the likelihood that the missing-data mechanism is ignored, and ‘com’ fits the model to a completed classified sample
iter.max	Maximum number of iterations allowed. Defaults to 500
eval.max	Maximum number of evaluations of the objective function allowed. Defaults to 500
rel.tol	Relative tolerance. Defaults to 1e-15
sing.tol	Singular convergence tolerance. Defaults to 1e-15

Table 3. Table 3: Results for the Gastrointestinal dataset. The dataset has n = 76 𝑛 76 n=76 observations, p = 4 𝑝 4 p=4 features, and g = 2 𝑔 2 g=2 classes.

	$n_{c}$ (classified)	$n_{u c}$ (unclassified)	Error rate
$R ({\hat{𝜽}}_{PC}^{(full)})$	35	41	0.158
$R ({\hat{𝜽}}_{PC}^{(ig)})$	35	41	0.211
$R ({\hat{𝜽}}_{CC})$	76	0	0.171

Equations58

θ = (π_{1}, \dots, π_{g - 1}, ω_{1}^{T}, \dots, ω_{g}^{T})^{T} .

θ = (π_{1}, \dots, π_{g - 1}, ω_{1}^{T}, \dots, ω_{g}^{T})^{T} .

k = ar g i max τ_{i} (\by; θ),

k = ar g i max τ_{i} (\by; θ),

τ_{i} (\by; θ) = π_{i} f_{i} (\by; ω_{i}) / h = 1 \sum g π_{h} f_{h} (\by; ω_{h})

τ_{i} (\by; θ) = π_{i} f_{i} (\by; ω_{i}) / h = 1 \sum g π_{h} f_{h} (\by; ω_{h})

f_{i} (\by; ω_{i}) = ϕ (\by; μ_{i}, Σ_{i}) (i = 1, \dots, g),

f_{i} (\by; ω_{i}) = ϕ (\by; μ_{i}, Σ_{i}) (i = 1, \dots, g),

f (\by; θ) = i = 1 \sum g π_{i} ϕ (\by_{j}; μ_{i}, Σ_{i}) .

f (\by; θ) = i = 1 \sum g π_{i} ϕ (\by_{j}; μ_{i}, Σ_{i}) .

lo g L_{C} (θ)

lo g L_{C} (θ)

lo g L_{UC} (θ)

lo g L_{PC}^{(ig)} (θ)

e (\by_{j}; θ) = - i = 1 \sum g τ_{i} (\by_{j}; θ) lo g τ_{i} (\by_{j}; θ) .

e (\by_{j}; θ) = - i = 1 \sum g τ_{i} (\by_{j}; θ) lo g τ_{i} (\by_{j}; θ) .

Pr {M_{j} = 1 ∣ \by_{j}, z_{j}} = Pr {M_{j} = 1 ∣ \by_{j}} = q (\by_{j}; θ, ξ),

Pr {M_{j} = 1 ∣ \by_{j}, z_{j}} = Pr {M_{j} = 1 ∣ \by_{j}} = q (\by_{j}; θ, ξ),

q (\by_{j}; θ, ξ) = \frac{exp { ξ _{0} + ξ _{1} lo g e ( \by _{j} ; θ )}}{1 + exp { ξ _{0} + ξ _{1} lo g e ( \by _{j} ; θ )}},

q (\by_{j}; θ, ξ) = \frac{exp { ξ _{0} + ξ _{1} lo g e ( \by _{j} ; θ )}}{1 + exp { ξ _{0} + ξ _{1} lo g e ( \by _{j} ; θ )}},

q (\by_{j}; β, ξ) = \frac{exp { ξ _{0} - ξ _{1} d ^{2} ( \by _{j} ; β )}}{1 + exp { ξ _{0} - ξ _{1} d ^{2} ( \by _{j} ; β )}},

q (\by_{j}; β, ξ) = \frac{exp { ξ _{0} - ξ _{1} d ^{2} ( \by _{j} ; β )}}{1 + exp { ξ _{0} - ξ _{1} d ^{2} ( \by _{j} ; β )}},

d (\by_{j}; β) = β_{0} + β_{1}^{⊤} \by_{j},

d (\by_{j}; β) = β_{0} + β_{1}^{⊤} \by_{j},

β_{0} = lo g (π_{1} / π_{2}) - \frac{1}{2} (μ_{1} + μ_{2})^{⊤} Σ^{- 1} (μ_{1} - μ_{2}) and β_{1} = Σ^{- 1} (μ_{1} - μ_{2}) .

β_{0} = lo g (π_{1} / π_{2}) - \frac{1}{2} (μ_{1} + μ_{2})^{⊤} Σ^{- 1} (μ_{1} - μ_{2}) and β_{1} = Σ^{- 1} (μ_{1} - μ_{2}) .

f (\by_{j}, z_{j}, m_{j} = 0) and f (\by_{j}, m_{j} = 1),

f (\by_{j}, z_{j}, m_{j} = 0) and f (\by_{j}, m_{j} = 1),

f (\by_{j}, z_{j}, m_{j} = 0) = f (z_{j}) f (\by_{j} ∣ z_{j}) Pr {M_{j} = 0 ∣ \by_{j}, z_{j}} = i = 1 \prod g {π_{i} f_{i} (\by_{j}; ω_{i})}^{z_{ij}} {1 - q (\by_{j}; θ, ξ)},

f (\by_{j}, z_{j}, m_{j} = 0) = f (z_{j}) f (\by_{j} ∣ z_{j}) Pr {M_{j} = 0 ∣ \by_{j}, z_{j}} = i = 1 \prod g {π_{i} f_{i} (\by_{j}; ω_{i})}^{z_{ij}} {1 - q (\by_{j}; θ, ξ)},

f (\by_{j}, m_{j} = 1) = f (\by_{j}) Pr {M_{j} = 1 ∣ \by_{j}} = i = 1 \sum g π_{i} f_{i} (\by_{j}; ω_{i}) q (\by_{j}; θ, ξ) .

f (\by_{j}, m_{j} = 1) = f (\by_{j}) Pr {M_{j} = 1 ∣ \by_{j}} = i = 1 \sum g π_{i} f_{i} (\by_{j}; ω_{i}) q (\by_{j}; θ, ξ) .

L_{PC}^{(full)} (Ψ) = j = 1 \prod n {f (\by_{j}, z_{j}, m_{j} = 0)}^{1 - m_{j}} {f (\by_{j}, m_{j} = 1)}^{m_{j}} .

L_{PC}^{(full)} (Ψ) = j = 1 \prod n {f (\by_{j}, z_{j}, m_{j} = 0)}^{1 - m_{j}} {f (\by_{j}, m_{j} = 1)}^{m_{j}} .

lo g L_{PC}^{(full)} (Ψ) = lo g L_{PC}^{(ig)} (θ) + lo g L_{PC}^{(miss)} (Ψ),

lo g L_{PC}^{(full)} (Ψ) = lo g L_{PC}^{(ig)} (θ) + lo g L_{PC}^{(miss)} (Ψ),

lo g L_{PC}^{(miss)} (Ψ) = j = 1 \sum n [(1 - m_{j}) lo g {1 - q (\by_{j}; Ψ)} + m_{j} lo g q (\by_{j}; Ψ)]

lo g L_{PC}^{(miss)} (Ψ) = j = 1 \sum n [(1 - m_{j}) lo g {1 - q (\by_{j}; Ψ)} + m_{j} lo g q (\by_{j}; Ψ)]

err (\hat{θ}_{CC}; θ) = 1 - i = 1 \sum g π_{i} Pr {R (\bY; \hat{θ}_{C C}) = i ∣ \hat{θ}_{C C}, Z = i} .

err (\hat{θ}_{CC}; θ) = 1 - i = 1 \sum g π_{i} Pr {R (\bY; \hat{θ}_{C C}) = i ∣ \hat{θ}_{C C}, Z = i} .

I_{PC}^{(full)} (β) = I_{CC} (β) - γ (Ψ) I_{CC}^{(clr)} (β) + I_{PC}^{(miss)} (β),

I_{PC}^{(full)} (β) = I_{CC} (β) - γ (Ψ) I_{CC}^{(clr)} (β) + I_{PC}^{(miss)} (β),

I_{PC}^{(miss)} (β) > γ (Ψ) I_{CC}^{(clr)} (β),

I_{PC}^{(miss)} (β) > γ (Ψ) I_{CC}^{(clr)} (β),

z_{ij}^{(k)} = \E_{Ψ^{(k)}} {Z_{ij} ∣ \by} = Pr_{Ψ^{(k)}} {Z_{ij} = 1 ∣ \by} = τ_{i} (\by_{j}; Ψ^{(k)}) = \frac{π _{i}^{(k)} ϕ ( \by _{j} ; μ _{i}^{(k)} , Σ _{i}^{(k)} )}{\sum _{h = 1}^{g} π _{h}^{(k)} ϕ ( \by _{j} ; μ _{h}^{(k)} , Σ _{h}^{(k)} )} .

z_{ij}^{(k)} = \E_{Ψ^{(k)}} {Z_{ij} ∣ \by} = Pr_{Ψ^{(k)}} {Z_{ij} = 1 ∣ \by} = τ_{i} (\by_{j}; Ψ^{(k)}) = \frac{π _{i}^{(k)} ϕ ( \by _{j} ; μ _{i}^{(k)} , Σ _{i}^{(k)} )}{\sum _{h = 1}^{g} π _{h}^{(k)} ϕ ( \by _{j} ; μ _{h}^{(k)} , Σ _{h}^{(k)} )} .

Q (Ψ; Ψ^{(k)}) = j = 1 \sum n (1 - m_{j}) i = 1 \sum g z_{ij} {lo g π_{i} + ϕ (\by_{j}; μ_{i}, Σ_{i})} + j = 1 \sum n m_{j} i = 1 \sum g z_{ij}^{(k)} {lo g π_{i} + ϕ (\by_{j}; μ_{i}, Σ_{i})} + j = 1 \sum n [(1 - m_{j}) lo g {1 - q (\by_{j}; θ, ξ)} + m_{j} lo g q (\by_{j}; θ, ξ)] .

Q (Ψ; Ψ^{(k)}) = j = 1 \sum n (1 - m_{j}) i = 1 \sum g z_{ij} {lo g π_{i} + ϕ (\by_{j}; μ_{i}, Σ_{i})} + j = 1 \sum n m_{j} i = 1 \sum g z_{ij}^{(k)} {lo g π_{i} + ϕ (\by_{j}; μ_{i}, Σ_{i})} + j = 1 \sum n [(1 - m_{j}) lo g {1 - q (\by_{j}; θ, ξ)} + m_{j} lo g q (\by_{j}; θ, ξ)] .

θ^{(k + 1)} = θ ar g max Q (θ, ξ^{(k)}; θ^{(k)}, ξ^{(k)}),

θ^{(k + 1)} = θ ar g max Q (θ, ξ^{(k)}; θ^{(k)}, ξ^{(k)}),

\begin{split}Q(\boldsymbol{\theta},\boldsymbol{\xi}^{(k)};\boldsymbol{\theta}^{(k)},\boldsymbol{\xi}^{(k)})&=\sum_{j=1}^{n}\big{[}(1-m_{j})\sum_{i=1}^{g}z_{ij}^{(k)}\{\log\pi_{i}+\log f_{i}(\by_{j};\boldsymbol{\omega}_{i})\}\\ &\quad+m_{j}\sum_{i=1}^{g}z_{ij}^{(k)}\{\log\pi_{i}+\log f_{i}(\by_{j};\boldsymbol{\omega}_{i})\}\big{]}\\ &\quad+\sum_{j=1}^{n}\big{[}(1-m_{j})\log\{1-q(\by_{j};\boldsymbol{\theta},\boldsymbol{\xi}^{(k)})\}+m_{j}\log q(\by_{j};\boldsymbol{\theta},\boldsymbol{\xi}^{(k)})\big{]}.\end{split}

\begin{split}Q(\boldsymbol{\theta},\boldsymbol{\xi}^{(k)};\boldsymbol{\theta}^{(k)},\boldsymbol{\xi}^{(k)})&=\sum_{j=1}^{n}\big{[}(1-m_{j})\sum_{i=1}^{g}z_{ij}^{(k)}\{\log\pi_{i}+\log f_{i}(\by_{j};\boldsymbol{\omega}_{i})\}\\ &\quad+m_{j}\sum_{i=1}^{g}z_{ij}^{(k)}\{\log\pi_{i}+\log f_{i}(\by_{j};\boldsymbol{\omega}_{i})\}\big{]}\\ &\quad+\sum_{j=1}^{n}\big{[}(1-m_{j})\log\{1-q(\by_{j};\boldsymbol{\theta},\boldsymbol{\xi}^{(k)})\}+m_{j}\log q(\by_{j};\boldsymbol{\theta},\boldsymbol{\xi}^{(k)})\big{]}.\end{split}

ξ^{(k + 1)} = ξ ar g max Q (θ^{(k + 1)}, ξ; θ^{(k)}, ξ^{(k)}),

ξ^{(k + 1)} = ξ ar g max Q (θ^{(k + 1)}, ξ; θ^{(k)}, ξ^{(k)}),

ξ^{(k + 1)} = ξ ar g max lo g L_{PC}^{(miss)} (θ^{(k + 1)}, ξ),

ξ^{(k + 1)} = ξ ar g max lo g L_{PC}^{(miss)} (θ^{(k + 1)}, ξ),

lo g L_{PC}^{(full)} (Ψ^{(k + 1)}) - lo g L_{PC}^{(full)} (Ψ^{(k)})

lo g L_{PC}^{(full)} (Ψ^{(k + 1)}) - lo g L_{PC}^{(full)} (Ψ^{(k)})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference · Bayesian Modeling and Causal Inference

Full text

\cleanlookdateon\runningheads

Semi-supervised Gaussian mixture modelling in RZ Lyu, D Ahfock, R Thompson, G J McLachlan

Semi-supervised Gaussian mixture modelling with a missing-data mechanism in R

Ziyang Lyu\addressnum1\corrauth

Daniel Ahfock\addressnum2

Ryan Thompson\addressnum1,3

and Geoffrey J. McLachlan\addressnum2

University of New South Wales and University of Queensland

\addressnum1 UNSW Data Science Hub and School of Mathematics and Statisics, University of New South Wales, NSW 2052 Australia Email: [email protected]

\addressnum2 School of Mathematics and Physics, University of Queensland, QLD 4072 Australia

\addressnum3 Data61, Commowealth Scientific and Industrial Research Organisation, NSW 2015 Australia

Abstract

Semi-supervised learning is being extensively applied to estimate classifiers from training data in which not all the labels of the feature vectors are available. We present gmmsslm, an R package for estimating the Bayes’ classifier from such partially classified data in the case where the feature vector has a multivariate Gaussian (normal) distribution in each of the predefined classes. Our package implements a recently proposed Gaussian mixture modelling framework that incorporates a missingness mechanism for the missing labels in which the probability of a missing label is represented via a logistic model with covariates that depend on the entropy of the feature vector. Under this framework, it has been shown that the accuracy of the Bayes’ classifier formed from the Gaussian mixture model fitted to the partially classified training data can even have lower error rate than if it were estimated from the sample completely classified. This result was established in the particular case of two Gaussian classes with a common covariance matrix. Here, we focus on the effective implementation of an algorithm for multiple Gaussian classes with arbitrary covariance matrices. A strategy for initialising the algorithm is discussed and illustrated. The new package is demonstrated on some real data.

keywords:

Bayes’ rule; entropy; mixture model; partially classified sample; semi-supervised learning

1 Introduction

Classifiers such as neural networks often achieve their strong performance through a completely supervised learning approach, which requires a fully classified (labelled) dataset. However, missing labels often occur in practice due to difficulty determining the true label for an observation (the feature vector). For example, in medicine and defence, images can often only be correctly classified by a limited number of experts in the field. Hence, a training sample might not be completely classified, with images difficult to classify left without their class labels. Moreover, in medical diagnosis, there might be scans that can be diagnosed confidently only after an invasive procedure, possibly regarded as unethical to perform at the time.

Semi-supervised learning (Chapelle, Schölkopf & Zien, 2006) addresses the issue of missing labels. Classic approaches that belong to the semi-supervised learning paradigm include generative models (Pan et al., 2006; Kim & Kang, 2007; Fujino, Ueda & Saito, 2008), graph-based models (Blum & Mitchell, 1998; Szummer & Jaakkola, 2001; Zhou et al., 2003), and semi-supervised support vector machines (Vapnik, 1998; Joachims, 1999; Lanckriet et al., 2004). Gaussian mixture models are a fundamental class of statistical models particularly relevant to semi-supervised learning. Given a partially classified sample, the traditional optimisation objective is a joint likelihood over the labelled and unlabelled data. This problem of maximum likelihood (ML) with missing data is amenable to the expectation-maximisation (EM) algorithm (Dempster, Laird & Rubin, 1977); see, for example, MacLahlan & Peel (2000) and the recent review of McLachlan, Lee & Rathnayake (2019) on finite mixture models. Although Gaussian mixture models in the semi-supervised setting are now well studied (Pan et al., 2006; Kim & Kang, 2007; Côme et al., 2009; Huang & Hasegawa-Johnson, 2010; Szczurek et al., 2010), there is often a critical assumption that the missing-label process can be ignored for likelihood-based inference (McLachlan, 1975a, 1977; Ganesalingam & McLachlan, 1978; Chawla & Karakoulas, 2005).

In a recent study, Ahfock & McLachlan (2020) introduced a novel approach that treats labels of unclassified observations as missing data, leveraging a framework for handling missingness, as in the groundbreaking work by Rubin (1976) on incomplete data analysis. Ahfock & McLachlan (2020) conducted an asymptotic analysis to demonstrate that a partially classified sample can provide more valuable information than a fully labelled sample, specifically in the two-class Gaussian homoscedastic model; see also the review by Ahfock & McLachlan (2023). By building upon their framework, we adapt the probability of a missing label to depend on a logistic model with a covariate equal to an entropy-based measure. This adaptation enables the implementation of an algorithm for estimating the Bayes’ classifier through the full likelihood, catering to multiple Gaussian classes with arbitrary covariance matrices.

In the context of our more general framework, we introduce the R package gmmsslm (Gaussian mixture model-based semi-supervised learning with a missing-data mechanism), which is available open-source on the Comprehensive R Archive Network at https://cran.r-project.org/package=gmmsslm. This package implements three distinct Gaussian mixture modelling methods. Although various packages exist for estimating mixture models, such as bgmm (Biecek et al., 2012), EMMIX (McLachlan et al., 1999), flexmix (Grun & Leisch, 2007), mclust (Fraley & Raftery, 2007), mixtools (Benaglia et al., 2009), and Rmixmod (Lebret et al., 2015), none accommodate a missing-data mechanism. In gmmsslm, the missingness mechanism is specified through multinomial logistic regression concerning the entropy of feature vectors. The package applies to an arbitrary number of classes possessing multivariate Gaussian distributions with potentially dissimilar covariance matrices.

The paper is structured as follows. Section 2 defines the main notation and introduces some core functionalities of our package. Section 3 describes the mechanism for handling missing data. Section 4 considers an expectation-conditional maximisation algorithm for fitting mixture models and presents the primary routine gmmsslm(). Section 5 demonstrates a practical application of the package. Finally, Section 6 concludes the paper.

2 Mixture modelling

2.1 Notation

We let $\bY$ be a $p$ -dimensional vector of features on an entity to be assigned to one of $g$ predefined classes $C_{1},\,\ldots,\,C_{g}$ occurring in proportions $\pi_{1},\,\ldots,\,\pi_{g}$ , where $\sum_{i=1}^{g}\pi_{i}=1.$ The random variable $\bY$ corresponding to the realization $\by$ is assumed to have density $f_{i}(\by;\boldsymbol{\omega}_{i})$ ) known up to a vector $\omega_{i}$ of unknown parameters in Class $C_{i}\,(i=1,\,\ldots,\,g)$ The vector of unknown parameters is given by

[TABLE]

The optimal (Bayes’) rule of allocation $R(\by;\boldsymbol{\theta})$ assigns an entity with feature vector $\by$ to Class $C_{k}$ (that is, $R(\by;\boldsymbol{\theta})=k$ if

[TABLE]

where

[TABLE]

is the posterior probability that the entity belongs to Class $C_{i}$ given $\bY=\by\,(i=1,\,\ldots,\,g)$ .

In the sequel, we assume that the class-conditional densities of $\by$ are multivariate Gaussian with

[TABLE]

where $\phi(\by;\boldsymbol{\mu},\boldsymbol{\Sigma})$ denotes the $p$ -variate Gaussian density function with mean $\mu$ and covariance matrix $\boldsymbol{\Sigma}$ . The vector $\boldsymbol{\theta}$ of all unknown parameters now consists of the elements of the means $\mu_{i}$ and the $\textstyle\frac{1}{2}p(p+1)$ distinct elements of the covariance matrices $\boldsymbol{\Sigma}_{i}$ , along with the mixing proportions. In this Gaussian setting, the function bayesclassifier(dat,pi,mu,sigma) actualises the Bayes’ classifier, while the function get_clusterprobs(dat,pi,mu,sigma) calculates the aforementioned posterior probabilities. The input arguments for these functions are detailed in Table 1.

The input data, designated as dat, is an $n\times p$ matrix where each row signifies an individual observation.

In order to estimate $\boldsymbol{\theta}$ it is customary in practice to have available a training sample. We let $\bx_{\rm CC}=(\bx_{1}^{T},\,\ldots,\,\bx_{n}^{T})^{T}$ , contain $n$ independent realisations of $\bX=(\bY^{T},Z)^{T}$ as the completely classified training data, where $Z$ denotes the class membership of $\bY$ , being equal to $i$ if $\bY$ belongs to Class $C_{i}\,(i=1,\,\ldots,\,g)$ , and zero otherwise, and where $\bx_{j}=(\by_{j}^{T},z_{j})^{T}\,(j=1,\,\ldots,\,n)$ . For a partially classified training sample $\bx_{\rm PC}$ in semi-supervised learning (SSL), we introduce the missing-label indicator $m_{j}$ which equals 1 if $z_{j}$ is missing and 0 if it is available $(j=1,\,\ldots,\,n).$ Thus $\bx_{\rm PC}$ consists of those observations $\bx_{j}$ in $\bx_{\rm CC}$ with $m_{j}=0$ , but only the feature vector $\by_{j}$ in $\bx_{\rm CC}$ if $m_{j}=1$ (that is, the label $z_{j}$ is missing). The presence of unclassified feature observations in the training data (that is, features with missing labels) necessitates the consideration nd fitting of the unconditional density of $\bY$ , which is given by the $g$ -component Gaussian mixture density,

[TABLE]

2.2 Likelihood functions

The sample version of the Bayes’ classifier can be constructed from partially classified data via ML estimation of $\boldsymbol{\theta}$ using the EM algorithm and its ECM variant (Dempster, Laird & Rubin, 1977); see also McLachlan & Krishnan (2008). We define the log likelihoods

[TABLE]

In (1), $z_{ij}=1$ if $z_{j}=i$ and otherwise $z_{ij}=0$ . If one ignores the ‘missingness’ of the class labels, $L_{\mathrm{C}}(\boldsymbol{\theta})$ and $L_{\mathrm{UC}}(\boldsymbol{\theta})$ are the likelihood functions formed from the classified and unclassified data, respectively. The likelihood function $L_{\mathrm{PC}}^{(\mathrm{ig})}(\boldsymbol{\theta})$ is formed from the partially classified sample $\bx_{\mathrm{PC}}$ , ignoring the missing-data mechanism for the labels. The likelihood $L_{\mathrm{CC}}(\boldsymbol{\theta})$ for the completely classified sample $\bx_{\mathrm{CC}}$ is recovered from (1) by taking all $m_{j}=0$ .

In the present context, it is appropriate to dispense with the missing-data mechanism when performing likelihood inference in situations where the missing labels can be viewed as missing completely at random (MCAR) in the framework proposed by Rubin (1976) for handling missingness in incomplete data analysis. The reader is referred to Mealli & Rubin (2015) for precise definitions of MCAR and its less restrictive version of missing at random (MAR). The MCAR case here holds if the missingness of labels is independent of both the features and labels, while with the MAR case this missingness is allowed to depend on the features but not the labels. It may be permissible to ignore the missingness of the labels as in the MAR example of truncated features analysed by McLachlan & Gordon (1989). However, in the MAR case under consideration here, the missingness mechanism is such that it is not ignorable in carrying out the appropriate likelihood analysis.

3 Likelihoods with missingness mechanism

3.1 Missing-data mechanism

Ahfock & McLachlan (2020) noted that it is common in practice for unlabelled images (that is, the features with missing labels) to fall in regions of the feature space where there is class overlap. This finding led them to argue that the unlabelled observations can carry additional information that can be used to improve the efficiency of the parameter estimation of $\boldsymbol{\theta}$ .

They noted that in these situations the difficulty of classifying an observation can be quantified using the Shannon entropy of an entity with feature vector $\by$ , which is defined by

[TABLE]

The function get_entropy(dat,pi,mu,sigma) implements Shannon entropy.

Let $M_{j}$ denote the random variable corresponding to the realised value $m_{j}$ of the missing-label indicator for the observation $\by_{j}$ . The missingness mechanism of Rubin (1976) is specified in the present context as

[TABLE]

where $\boldsymbol{\xi}=(\xi_{0},\xi_{1})^{\top}$ is distinct from $\boldsymbol{\theta}$ . The conditional probability $q(\by_{j};\boldsymbol{\theta},\boldsymbol{\xi})$ is taken to be a logistic function of the Shannon entropy $e(\by_{j};\boldsymbol{\theta})$ , yielding

[TABLE]

where $e(\by_{j};\boldsymbol{\theta})=-\sum_{i=1}^{g}\tau_{i}(\by_{j};\boldsymbol{\theta})\log\tau_{i}(\by_{j};\boldsymbol{\theta})$ .

To simplify the numerical computations, Ahfock & McLachlan (2020) gave an asymptotic linear relationship between the negative log entropy $-\log\{e(\by_{j};\boldsymbol{\theta})\}$ and the square of the discriminant function $d^{2}(\by_{j};\boldsymbol{\beta})$ in the special case of $g=2$ with equal covariance matrices. Therefore, the negative log entropy in the conditional probability (4) can be replaced by the square of the discriminant function:

[TABLE]

where

[TABLE]

with

[TABLE]

The missing data indicator based on the entropy ( $g>2$ ) or the square of the discriminant function ( $g=2$ ) can be generated using the function rlabel(dat,pi,mu,sigma,xi). The element of the outputs represent a missing-label when equal to one and an available label when equal to zero. Regarding the vector of the group partition clust from rmix, we denote NA as the missing label (when the missing-data indicator equals one).

3.2 Full likelihood function based on a missingness mechanism

We let $\boldsymbol{\Psi}=(\boldsymbol{\theta}^{\top},\boldsymbol{\xi}^{\top})^{\top}$ be the vector of all unknown parameters. Henceforth, we let $f$ be a generic symbol for a density or probability function where appropriate. To construct the full likelihood function $L_{\mathrm{PC}}^{(\mathrm{full})}(\boldsymbol{\Psi})$ from the partially classified sample $\bx_{\mathrm{PC}}$ , we need expressions for

[TABLE]

for the classified and unclassified observations $\by_{j}$ , respectively. For a classified observation $\by_{j}$ , it follows that

[TABLE]

while for an unclassified observation $\by_{j}$ , we have

[TABLE]

Thus, the full likelihood can be expressed as

[TABLE]

The full log likelihood follows as

[TABLE]

where

[TABLE]

is the log likelihood formed on the basis of the missing-label indicators $m_{j}$ .

The log likelihood $\log L_{\rm PC}^{(\rm ig)}(\boldsymbol{\theta})$ formed ignoring the missing-label indicators can be calculated by loglk_ig(dat,zm,pi,mu,sigma). The input parameters must be structured as described in Table 1. In addition, the input data are specified by dat, an $n\times p$ matrix where each row represents an individual observation, and zm, a $n$ dimensional vector of group partition with the missing label NA.

The log likelihood function formed on the basis of the missing-label indicator (5) can be calculated by loglk_miss(dat,zm,pi,mu,sigma,xi). The inputs are the same as those described in loglk_ig(). In addition, xi is a two-dimensional vector containing $\xi_{0}$ and $\xi_{1}$ that represent the parameters of the condition probability $q(\by_{j};\boldsymbol{\theta},\boldsymbol{\xi})$ . As previously conveyed, the conditional probability $q(\by_{j};\boldsymbol{\Psi})$ is based on the log entropy ( $g>2$ ) or the square of the discriminant function (two-class Gaussian homoscedastic model). The full log likelihood with a missing-data mechanism can be calculated by loglk_full(dat,zm,pi,mu,sigma,xi).

We let $\hat{\boldsymbol{\theta}}_{\mathrm{CC}}$ , $\boldsymbol{\theta}_{\mathrm{PC}}^{(\mathrm{ig})}$ , and $\boldsymbol{\theta}_{\mathrm{PC}}^{(\mathrm{full})}$ be the estimates of $\boldsymbol{\theta}$ formed by consideration of the likelihoods $L_{\mathrm{CC}}(\boldsymbol{\theta})$ , $L_{\mathrm{PC}}^{(\mathrm{full})}(\boldsymbol{\theta})$ , and $L_{\mathrm{PC}}^{(\mathrm{full})}(\boldsymbol{\Psi})$ , respectively. Similarly, we let $R(\hat{\boldsymbol{\theta}}_{\mathrm{CC}})$ , $R(\hat{\boldsymbol{\theta}}_{\mathrm{PC}}^{(\mathrm{ig})})$ , and $R(\hat{\boldsymbol{\theta}}_{\mathrm{PC}}^{(\mathrm{full})})$ denote the estimated Bayes’ rule obtained by plugging in the estimates $\hat{\boldsymbol{\theta}}_{\mathrm{CC}}$ , $\hat{\boldsymbol{\theta}}_{\mathrm{PC}}^{(\mathrm{ig})}$ and $\hat{\boldsymbol{\theta}}_{\mathrm{PC}}^{(\mathrm{full})}$ , respectively. Hereafter, we abbreviate the above notation as $R$ with superscripts ${(\mathrm{ig})}$ and ${(\mathrm{full})}$ and subscripts ${\mathrm{CC}}$ and ${\mathrm{PC}}$ where appropriate. The overall conditional error rate of the rule $R(\by;\hat{\boldsymbol{\theta}}_{\mathrm{CC}})$ is then

[TABLE]

The corresponding conditional error rate $\operatorname{err}(\hat{\boldsymbol{\theta}}_{PC}^{(\mathrm{ig})};\boldsymbol{\theta})$ of the rule $R(\by;\hat{\boldsymbol{\theta}}_{\mathrm{PC}}^{(\mathrm{ig})})$ is defined likewise. The optimal error rate $\operatorname{err}(\boldsymbol{\theta})$ follows by replacing $\hat{\boldsymbol{\theta}}_{\mathrm{CC}}$ with $\boldsymbol{\theta}$ in (6).

The distribution of $R(\bY;\boldsymbol{\theta})$ is complicated, and manageable analytical expressions are only obtainable in special cases, such as those addressed by Gilbert (1969), Han (1969), McLachlan (1975b), and Hawkins & Raath (1982).

3.3 Theoretical motivation

Under the model (4) for non-ignorable MAR labels in the case of the two-class homoscedastic Gaussian model, Ahfock & McLachlan (2020) derived the following theorem that motivates the development of a package to implement this semi-supervised learning approach for possibly multiple classes with multivariate Gaussian distributions.

Theorem 1.

The Fisher information about $\boldsymbol{\beta}$ in the partially classified sample $\bx_{\mathrm{PC}}$ via the full likelihood function $L_{\mathrm{PC}}^{(\mathrm{full})}(\boldsymbol{\Psi})$ can be decomposed as

[TABLE]

where $\boldsymbol{I}_{\mathrm{CC}}(\boldsymbol{\beta})$ is the information about $\boldsymbol{\beta}$ in the completely classified sample $\bx_{{\mathrm{CC}}},\boldsymbol{I}_{{\mathrm{CC}}}^{(\mathrm{clr})}(\boldsymbol{\beta})$ is the conditional information about $\boldsymbol{\beta}$ under the logistic regression model fitted to the class labels in $\bx_{{\mathrm{CC}}}$ , and $\boldsymbol{I}_{{\mathrm{PC}}}^{(\mathrm{miss})}(\boldsymbol{\beta})$ is the information about $\boldsymbol{\beta}$ in the missing-label indicators under the assumed logistic model for their distribution given their associated features in the partially classified sample $\bx_{{\mathrm{PC}}}$ .

The expression (7) for the Fisher information about the vector of discriminant function coefficients contains the additional term $\boldsymbol{I}_{{\mathrm{PC}}}^{(\mathrm{miss})}(\boldsymbol{\beta})$ , arising from the additional information about $\boldsymbol{\beta}$ in the missing-label indicators $m_{j}$ . This term has the potential to compensate for the loss of information in not knowing the true labels of those unclassified features in the partially unclassified sample. The compensation depends on the extent to which the probability of a missing label for a feature depends on its entropy. It follows that if

[TABLE]

there is an increase in the information about $\boldsymbol{\beta}$ from the partially classified sample over the information $\boldsymbol{I}_{{\mathrm{CC}}}(\boldsymbol{\beta})$ from the completely classified sample. The inequality in (8) is used to mean that the difference of the left- and right-hand sides of the inequality is a positive definite matrix.

By deriving the asymptotic relative efficiency of the Bayes’ rule using the full ML estimate of $\boldsymbol{\beta}$ , Ahfock & McLachlan (2020) showed that the asymptotic expected excess error rate using the partially classified sample $\bx_{\mathrm{PC}}$ can be much lower than the corresponding excess rate using the completely classified sample $\bx_{\mathrm{CC}}$ . The contribution to the Fisher information from the missingness mechanism can be relatively high if $\xi_{1}$ is large, as the location of the unclassified features in the feature space provides information about regions of high uncertainty, and hence, where the entropy is high.

4 Maximum likelihood estimation

4.1 ECM algorithm

We apply the expectation-conditional maximisation (ECM) algorithm of Meng & Rubin (1993) to compute the ML estimate of $\boldsymbol{\Psi}$ on the basis of the full likelihood $L_{\mathrm{PC}}^{(\mathrm{full})}(\boldsymbol{\Psi})$ . The adopted EM framework makes the obvious choice of declaring the ‘missing’ data to be the missing labels $\boldsymbol{z}_{j}$ for those features $\by_{j}$ with $m_{j}=1$ .

E step: It handles the presence of the introduced missing data by forming on the $(k+1)$ th iteration the so-called $Q$ -function $Q(\boldsymbol{\Psi};\boldsymbol{\Psi}^{(k)})$ equal to the expectation of the complete-data log likelihood conditional on the observed data $\by$ , using the current estimate $\boldsymbol{\Psi}^{(k)}$ for $\boldsymbol{\Psi}$ . As this complete-data log likelihood is linear in the missing-class labels, this expectation-conditional on $\by$ is affected by replacing the unobservable $z_{ij}$ by its conditional expectation given $\by,z_{ij}^{(k)}$ , where

[TABLE]

Accordingly, we have that

[TABLE]

We calculate the updated value $\boldsymbol{\Psi}^{(k+1)}$ of $\boldsymbol{\Psi}$ using two conditional maximisation (CM) steps.

CM step-1: We fix $\boldsymbol{\xi}$ at its current value $\boldsymbol{\xi}^{(k)}$ and update $\boldsymbol{\theta}$ to $\boldsymbol{\theta}^{(k+1)}$ given by

[TABLE]

where

[TABLE]

CM step-2: We now fix $\boldsymbol{\theta}$ at its updated value $\boldsymbol{\theta}^{(k+1)}$ and update $\boldsymbol{\xi}$ to $\boldsymbol{\xi}^{(k+1)}$ as

[TABLE]

which reduces to

[TABLE]

on retaining only terms that depend on $\boldsymbol{\xi}$ .

As $L_{\mathrm{PC}}^{(\mathrm{miss})}(\boldsymbol{\theta}^{(k+1)},\boldsymbol{\xi})$ belongs to the regular exponential family, we use the function glm() from the base package stats. The estimate $\boldsymbol{\Psi}_{{\mathrm{PC}}}^{(\mathrm{full})}$ is given by the limiting value of $\boldsymbol{\Psi}^{(k)}$ as $k$ tends to infinity. We take the ECM algorithm as having converged when

[TABLE]

is less than some arbitrarily specified value.

4.2 Model fitting

The ECM algorithm described above is implemented in the primary function gmmsslm(). The input arguments are described in Table 2.

The default choices of iter.max, eval.max, rel.tol, and sing.tol that control the behaviour of the algorithm often work well.

As an example, we apply gmmsslm() to a dataset generated from a mixture of $g=4$ trivariate Gaussian distributions in equal proportions, using the function rmix() with the missing-label indicator from rlabel() to obtain the estimated parameters based on the partially classified sample. The parameters are set as $n=300$ , $\boldsymbol{\pi}=(0.25,0.25,0.25,0.25)^{\top}$ , $\boldsymbol{\mu}_{1}=(0.2,0.3,0.4)^{\top}$ , $\boldsymbol{\mu}_{2}=(0.2,0.7,0.6)^{\top}$ , $\boldsymbol{\mu}_{3}=(0.1,0.7,1.6)^{\top}$ , $\boldsymbol{\mu}_{4}=(0.2,1.7,0.6)^{\top}$ , $\boldsymbol{\Sigma}_{i}=i\boldsymbol{I}$ for $i\in\{1,2,3,4\}$ , and $\boldsymbol{\xi}=(-0.5,1)^{\top}$ .

set.seed(8) n<-300 g<-4 p<-3 pi<-rep(1/g,g) mu<-matrix(c(0.2,0.3,0.4,0.2,0.7,0.6,0.1,0.7,1.6,0.2,1.7,0.6) ,p,g) sigma<-array(unlist(lapply(1:g, function(i) diag(i, p))), dim = c(p, p, g)) paralist<-list(pi=pi,mu=mu,sigma=sigma) dat<-rmix(n=n,pi=pi,mu=mu,sigma=sigma) xi<-c(-0.5,1) m<-rlabel(dat=dat $Y,pi=pi,mu=mu,sigma=sigma,xi=xi) zm<-replace(dat$ clust, m==1, NA)

We now fit the three models to the aforementioned partially and completely classified samples by utilising their respective likelihood functions and setting the appropriate parameters. Specifically, we set zm=zm or zm=dat $clust, and choose the model type using type=’full’, type=’ign’, or type=’com’ as needed. To obtain initial values for the parameters$ \boldsymbol{\pi} $,$ \boldsymbol{\mu} $, and$ \boldsymbol{\Sigma} $, we employ the initialvalue() function. It is crucial to note that we use ncov to represent options for the class covariance matrices. The default value for this parameter is ncov=2, indicating unequal covariance matrices, whereas ncov=1 is assigned for a common covariance matrix. For the initial values of$ \xi_{0} $and$ \xi_{1}$, we leverage the glm() function.

inits<-initialvalue(dat=dat $Y,zm=zm,g=g,ncov=2) en<-get_entropy(dat=dat$ Y,n=n,p=p,g=g,paralist=inits) xi_inits<-coef(glm(m~en,family=’binomial’)) fit_ful<-gmmsslm(dat=dat $Y,zm=zm,paralist=inits,xi=xi_inits, type=’full’) fit_ign<-gmmsslm(dat=dat$ Y,zm=zm,paralist=inits, type=’ign’) fit_com<-gmmsslm(dat=dat $Y,zm=dat$ clust,paralist=inits, type=’com’)

The functions yield a gmmsslm object as their output. To extract a comprehensive summary from the fitted model, we can employ the summary() function. This summary reports various aspects, including the likelihood value, fitted variance structure, convergence status, number of iterations, model type, total number of observations, dimensions, and estimated parameters.

summary(fit_ful) Table: [,1] Likelihood "2099.796" VarianceStructure "Unequal covariance matrices" Convergence "1" Iteration "108" TotalObservation "300" Dimension "3" ModelType "full"

Parameters: $pi [1] 0.2448038 0.2247527 0.2521564 0.2782871

$mu [,1] [,2] [,3] [,4] [1,] 0.2491248 0.1430182 0.3248906 0.2862982 [2,] 0.1632115 0.6807152 0.8309257 1.6827811 [3,] 0.3066612 0.1033765 1.7899401 0.4550641

$sigma , , 1

[,1] [,2] [,3] [1,] 0.91534868 -0.03619485 0.02974812 [2,] -0.03619485 0.68636971 -0.26421012 [3,] 0.02974812 -0.26421012 0.97870318 , , 2 [,1] [,2] [,3] [1,] 1.7435648 -0.5144975 -0.1351344 [2,] -0.5144975 2.3481520 -0.3188222 [3,] -0.1351344 -0.3188222 2.3780968 , , 3 [,1] [,2] [,3] [1,] 2.4470608 -0.2274167 -0.5799091 [2,] -0.2274167 3.2809171 -0.2257279 [3,] -0.5799091 -0.2257279 3.3297267 , , 4 [,1] [,2] [,3] [1,] 5.1069469 0.3922628 -0.1481315 [2,] 0.3922628 2.9109848 -0.1947038 [3,] -0.1481315 -0.1947038 4.3250141

$xi (Intercept) en -0.2490987 0.6779641

Additionally, we can utilise the predict() function to obtain predicted labels derived from the fit by gmmsslm(). This function returns the predicted labels for the unclassified data initially input into gmmsslm(). Subsequently, we can compute the corresponding classification metrics for each of the three classifiers, allowing for a more in-depth and specific analysis of the model’s performance.

err_ful<-erate(dat=dat_ul,p,g,paralist=paraextract(fit_ful), clust=clust_ul) err_ign<-erate(dat=dat_ul,p,g,paralist=paraextract(fit_ign), clust=clust_ul) err_com<-erate(dat=dat_ul,p,g,paralist=paraextract(fit_com), clust=clust_ul) c(err_ful,err_ign,err_com) [1] 0.4997755 0.5279257 0.4843896

5 Gastrointestinal dataset

5.1 Data description

We illustrate the functionality of gmmsslm on the gastrointestinal lesions data from Mesejo et al. (2016). The dataset comprises 76 colonoscopy videos, the histology (classification ground truth), and the opinions of the endoscopists (four experts and three beginners). White light and narrow band imaging methods were used to classify whether the lesions were benign or malignant. Each of the $n=76$ observations consists of four selected features extracted from the colonoscopy videos. A panel of seven endoscopists viewed the videos to give their opinion as to whether each patient needed resection (malignant) or no-resection (benign).

We formed our partially classified sample as follows. Feature vectors for which all seven endoscopists agreed were taken to be classified with labels specified either as 1 (resection) or 2 (no-resection) using the ground-truth labels. Observations for which there was not total agreement among the endoscopists were taken as having missing labels, denoted by NA.

Figure 1 shows a plot of the data with the class labels of the feature vectors used to fit the classifiers in our example.

The black squares denote the unlabelled observations, the red triangles denote the benign observations, and the blue circles denote the malignant observations. The unlabelled observations tend to be located in regions of class overlap.

To further confirm the appropriateness of the missing-data mechanism, we fitted a Gaussian mixture model to estimate the entropy of each observation. Figure 2(a) compares the box plots of the estimated entropies in the labelled and unlabelled groups. Figure 2(b) presents a Nadaraya-Watson kernel estimate of the probability of missing labels.

From Figure 2(a), we find that the unlabelled observations typically have higher entropy than the labelled observations. Figure 2(b) shows that the estimated probability of a missing-class label increases as the log entropy increases. This relation is in accordance with (4). The higher the entropy of a feature vector, the higher the probability of its class label being unavailable.

5.2 Results

We now use the function gmmsslm() to compare the performance of the Bayes’ classifier with the unknown parameter vector $\boldsymbol{\theta}$ estimated by $\hat{\boldsymbol{\theta}}_{\mathrm{CC}}$ , $\hat{\boldsymbol{\theta}}_{\mathrm{PC}}^{(\mathrm{ig})}$ , and $\hat{\boldsymbol{\theta}}_{\mathrm{PC}}^{(\mathrm{full})}$ . The three classifiers $R(\hat{\boldsymbol{\theta}}_{\mathrm{CC}})$ , $R(\hat{\boldsymbol{\theta}}_{\mathrm{PC}}^{(\mathrm{ig})})$ , and $R(\hat{\boldsymbol{\theta}}_{\mathrm{PC}}^{(\mathrm{full})})$ were applied to all the feature vectors in the completely classified dataset. Error rates are estimated using leave-one-out cross-validation and reported in Table 3.

The classifier based on the estimates of the parameters using the full likelihood for the partially classified training sample has lower estimated error rate than that of the rule that would be formed if the sample were completely classified.

6 Summary

The R package gmmsslm implements the semi-supervised learning approach proposed by Ahfock & McLachlan (2020) for estimating the Bayes’ classifier from a partially classified training sample in which some of the feature vectors have missing labels. It uses a generative model approach whereby the joint distribution of the feature vector and its ground-truth label is adopted. Each of $g$ pre-specified classes to which a feature vector can belong has the multivariate Gaussian distribution. The conditional probability that a feature vector has a missing label is formulated in a framework in which the missingness mechanism models this probability to depend on the entropy of the feature vector using a logistic model. The parameters in the Bayes’ classifier are estimated by maximum likelihood via an expectation-conditional maximisation algorithm. The package applies to classes with equal or unequal covariance matrices in their multivariate Gaussian distributions. In application to a real-world medical dataset, the estimated error rate of the Bayes’ classifier based on the partially classified training sample is lower than that of the Bayes’ classifier formed from a completely classified sample.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ahfock & Mc Lachlan (2020) Ahfock, D. & Mc Lachlan, G.J. (2020). An apparent paradox: A classifier based on a partially classified sample may have smaller expected error rate than that if the sample were completely classified. Statistics and Computing 30 , 1779–1790. 10.1007/s 11222-020-09971-5 . · doi ↗
2Ahfock & Mc Lachlan (2023) Ahfock, D. & Mc Lachlan, G.J. (2023). Semi-supervised learning of classifiers from a statistical perspective: A brief review. Econometrics and Statistics 26 , 124–138. doi.org/10.1016/j.ecosta.2022.03.007 . · doi ↗
3Benaglia et al. (2009) Benaglia, T., Chauveau, D., Hunter, D.R. & Young, D.S. (2009). mixtools: An R package for analyzing mixture models. Journal of Statistical Software 32 , 1–29. 10.18637/jss.v 032.i 06 . · doi ↗
4Biecek et al. (2012) Biecek, P., Szczurek, E., Vingron, M. & Tiuryn, J. (2012). The R package bgmm: Mixture modeling with uncertain knowledge. Journal of Statistical Software 47 , 1–31. 10.18637/jss.v 047.i 03 . · doi ↗
5Blum & Mitchell (1998) Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory . pp. 92–100. 10.1145/279943.279962 . · doi ↗
6Chapelle, Schölkopf & Zien (2006) Chapelle, O., Schölkopf, B. & Zien, A. (2006). Semi-Supervised Learning . Adaptive Computation and Machine Learning, Cambridge, MA: MIT Press. 10.1109/TNN.2009.2015974 . · doi ↗
7Chawla & Karakoulas (2005) Chawla, N.V. & Karakoulas, G. (2005). Learning from labeled and unlabeled data: An empirical study across techniques and domains. Journal of Artificial Intelligence Research 23 , 331–366. 10.1613/jair.1509 . · doi ↗
8Côme et al. (2009) Côme, E., Oukhellou, L., Denœux, T. & Aknin, P. (2009). Learning from partially supervised data using mixture models and belief functions. Pattern Recognition 42 , 334–348. 10.1016/j.patcog.2008.07.014 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Semi-supervised Gaussian mixture modelling with a missing-data mechanism in R

Abstract

keywords:

1 Introduction

2 Mixture modelling

2.1 Notation

2.2 Likelihood functions

3 Likelihoods with missingness mechanism

3.1 Missing-data mechanism

3.2 Full likelihood function based on a missingness mechanism

3.3 Theoretical motivation

Theorem 1**.**

4 Maximum likelihood estimation

4.1 ECM algorithm

4.2 Model fitting

5 Gastrointestinal dataset

5.1 Data description

5.2 Results

6 Summary

Theorem 1.