Asymptotic Bayes risk for Gaussian mixture in a semi-supervised setting

Marc Lelarge; Leo Miolane

arXiv:1907.03792·cs.LG·October 1, 2019

Asymptotic Bayes risk for Gaussian mixture in a semi-supervised setting

Marc Lelarge, Leo Miolane

PDF

TL;DR

This paper analytically quantifies the performance gain in semi-supervised learning over supervised learning for Gaussian mixture models in high dimensions, using advanced mathematical tools from statistical physics.

Contribution

It provides the first rigorous analysis of the asymptotic Bayes risk gap in semi-supervised Gaussian mixture models, leveraging recent high-dimensional inference theories.

Findings

01

Quantifies the accuracy improvement due to unlabeled data.

02

Provides explicit formulas for the Bayes risk gap.

03

Demonstrates the impact of unlabeled data in high-dimensional settings.

Abstract

Semi-supervised learning (SSL) uses unlabeled data for training and has been shown to greatly improve performance when compared to a supervised approach on the labeled data available. This claim depends both on the amount of labeled data available and on the algorithm used. In this paper, we compute analytically the gap between the best fully-supervised approach using only labeled data and the best semi-supervised approach using both labeled and unlabeled data. We quantify the best possible increase in performance obtained thanks to the unlabeled data, i.e. we compute the accuracy increase due to the information contained in the unlabeled data. Our work deals with a simple high-dimensional Gaussian mixture model for the data in a Bayesian setting. Our rigorous analysis builds on recent theoretical breakthroughs in high-dimensional inference and a large body of mathematical tools from…

Figures1

Click any figure to enlarge with its caption.

Equations145

Y_{j} = V_{j} U + σ Z_{j}, 1 \leq j \leq N,

Y_{j} = V_{j} U + σ Z_{j}, 1 \leq j \leq N,

S_{j} = {V_{j} 0 with probability η with probability 1 - η .

S_{j} = {V_{j} 0 with probability η with probability 1 - η .

Y_{new} = V_{new} U + σ Z_{new},

Y_{new} = V_{new} U + σ Z_{new},

R^{*}_{D}(\eta)=\inf_{\hat{v}}\mathbb{P}\big{(}\widehat{v}({\bm{Y}},{\bm{S}},{\bm{Y}}_{\rm new})\neq V_{\rm new}\big{)}

R^{*}_{D}(\eta)=\inf_{\hat{v}}\mathbb{P}\big{(}\widehat{v}({\bm{Y}},{\bm{S}},{\bm{Y}}_{\rm new})\neq V_{\rm new}\big{)}

\displaystyle R_{\rm oracle}=\mathbb{P}\big{(}\sigma\langle{\bm{U}},{\bm{Z}}_{\rm new}\rangle>1\big{)}=\mathbb{P}\big{(}\sigma Z>1)=1-\Phi\left(\frac{1}{\sigma}\right),

\displaystyle R_{\rm oracle}=\mathbb{P}\big{(}\sigma\langle{\bm{U}},{\bm{Z}}_{\rm new}\rangle>1\big{)}=\mathbb{P}\big{(}\sigma Z>1)=1-\Phi\left(\frac{1}{\sigma}\right),

\overline{Y}_{1} = \frac{1}{N} j = 1 \sum N Y_{j} = U_{1} + \frac{σ}{N} N (0, 1) .

\overline{Y}_{1} = \frac{1}{N} j = 1 \sum N Y_{j} = U_{1} + \frac{σ}{N} N (0, 1) .

⟨ Y_{new}, \overline{Y} ⟩ = V_{new} ⟨ U, \overline{Y} ⟩ + σ ⟨ Z_{new}, \overline{Y} ⟩,

⟨ Y_{new}, \overline{Y} ⟩ = V_{new} ⟨ U, \overline{Y} ⟩ + σ ⟨ Z_{new}, \overline{Y} ⟩,

⟨ Y_{new}, \overline{Y} ⟩ \approx V_{new} + \frac{σ α + σ ^{2}}{α} N (0, 1) .

⟨ Y_{new}, \overline{Y} ⟩ \approx V_{new} + \frac{σ α + σ ^{2}}{α} N (0, 1) .

N, D \to \infty lim R_{D}^{*} (1) = P (\frac{σ α + σ ^{2}}{α} Z > 1) = 1 - Φ (\frac{α}{σ α + σ ^{2}}) .

N, D \to \infty lim R_{D}^{*} (1) = P (\frac{σ α + σ ^{2}}{α} Z > 1) = 1 - Φ (\frac{α}{σ α + σ ^{2}}) .

R_{D}^{*} (0) = E_{Y} [s = \pm 1 min \overset{v}{^} in f P (s \overset{v}{^} (Y, Y_{new}) \neq = V_{new} ∣ Y)]

R_{D}^{*} (0) = E_{Y} [s = \pm 1 min \overset{v}{^} in f P (s \overset{v}{^} (Y, Y_{new}) \neq = V_{new} ∣ Y)]

η \to 0 lim N, D \to \infty lim R_{D}^{*} (η) = N, D \to \infty lim R_{D}^{*} (0) .

η \to 0 lim N, D \to \infty lim R_{D}^{*} (η) = N, D \to \infty lim R_{D}^{*} (0) .

f_{\alpha,\sigma,\eta}(q)\overset{{\rm def}}{=}\alpha(1-\eta)\mathsf{i}_{v}(q/\sigma^{2})+\frac{\alpha}{2\sigma^{2}}(1-q)-\frac{1}{2}\big{(}q+\log(1-q)\big{)}.

f_{\alpha,\sigma,\eta}(q)\overset{{\rm def}}{=}\alpha(1-\eta)\mathsf{i}_{v}(q/\sigma^{2})+\frac{\alpha}{2\sigma^{2}}(1-q)-\frac{1}{2}\big{(}q+\log(1-q)\big{)}.

R_{D}^{*} (η) N, D \to \infty 1 - Φ (q^{*} (α, σ, η) / σ) .

R_{D}^{*} (η) N, D \to \infty 1 - Φ (q^{*} (α, σ, η) / σ) .

P (V_{new} \neq = v) = R_{D}^{*} (η) + o_{N} (1) .

P (V_{new} \neq = v) = R_{D}^{*} (η) + o_{N} (1) .

E [(V_{i} - v_{i})^{2}] = mmse_{v} (1/ σ^{2})

E [(V_{i} - v_{i})^{2}] = mmse_{v} (1/ σ^{2})

{{\rm mmse}}_{v}(\gamma)\overset{{\rm def}}{=}\mathbb{E}\big{[}(V-\mathbb{E}[V|\sqrt{\gamma}V+Z])^{2}\big{]}.

{{\rm mmse}}_{v}(\gamma)\overset{{\rm def}}{=}\mathbb{E}\big{[}(V-\mathbb{E}[V|\sqrt{\gamma}V+Z])^{2}\big{]}.

⟨ \widebar u, Y_{i} ⟩ = ⟨ \widebar u, U ⟩ V_{i} + σ ⟨ \widebar u, Z_{i} ⟩ ≃ q_{u}^{*} V_{i} + σ ⟨ \widebar u, Z_{i} ⟩ .

⟨ \widebar u, Y_{i} ⟩ = ⟨ \widebar u, U ⟩ V_{i} + σ ⟨ \widebar u, Z_{i} ⟩ ≃ q_{u}^{*} V_{i} + σ ⟨ \widebar u, Z_{i} ⟩ .

\frac{1}{σ q _{u}^{*}} ⟨ \widebar u, Y_{i} ⟩ ≃ q_{u}^{*} / σ^{2} V_{i} + Z

\frac{1}{σ q _{u}^{*}} ⟨ \widebar u, Y_{i} ⟩ ≃ q_{u}^{*} / σ^{2} V_{i} + Z

E [(V_{i} - v_{i})^{2}] ≃ mmse_{v} (q_{u}^{*} / σ^{2}) .

E [(V_{i} - v_{i})^{2}] ≃ mmse_{v} (q_{u}^{*} / σ^{2}) .

\frac{1}{N} E ∥ V - \widebar v ∥^{2}

\frac{1}{N} E ∥ V - \widebar v ∥^{2}

≃ \frac{1}{N} i ∣ S_{i} = 0 \sum mmse_{v} (q_{u}^{*} / σ^{2}) ≃ (1 - η) mmse_{v} (q_{u}^{*} / σ^{2}) .

1 - q_{v}^{*} ≃ (1 - η) mmse_{v} (q_{u}^{*} / σ^{2}) .

1 - q_{v}^{*} ≃ (1 - η) mmse_{v} (q_{u}^{*} / σ^{2}) .

R_{i} = U_{i} V + σ \tilde{Z}_{i} .

R_{i} = U_{i} V + σ \tilde{Z}_{i} .

\frac{1}{N σ ^{2} q _{v}^{*}} ⟨ \widebar v, R_{i} ⟩ ≃ \frac{α q _{v}^{*}}{σ ^{2}} D U_{i} + Z,

\frac{1}{N σ ^{2} q _{v}^{*}} ⟨ \widebar v, R_{i} ⟩ ≃ \frac{α q _{v}^{*}}{σ ^{2}} D U_{i} + Z,

E [D (U_{i} - u_{i})^{2}] ≃ mmse_{u} (α q_{v}^{*} / σ^{2}),

E [D (U_{i} - u_{i})^{2}] ≃ mmse_{u} (α q_{v}^{*} / σ^{2}),

E [∥ U - \widebar u ∥^{2}] = 1 - q_{u}^{*} ≃ mmse_{u} (α q_{v}^{*} / σ^{2}),

E [∥ U - \widebar u ∥^{2}] = 1 - q_{u}^{*} ≃ mmse_{u} (α q_{v}^{*} / σ^{2}),

mmse_{u} (γ) = \frac{1}{1 + γ} .

mmse_{u} (γ) = \frac{1}{1 + γ} .

q_{v}^{*}

q_{v}^{*}

q_{u}^{*}

i_{v} (γ) = I (V_{0}; γ V_{0} + Z_{0})

i_{v} (γ) = I (V_{0}; γ V_{0} + Z_{0})

i_{v} (γ) = γ - E lo g cosh (γ Z_{0} + γ) .

i_{v} (γ) = γ - E lo g cosh (γ Z_{0} + γ) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Asymptotic Bayes risk for Gaussian mixture in a semi-supervised setting

Marc Lelarge

INRIA-ENS

Paris, France

[email protected]

&Léo Miolane

INRIA-ENS

Paris, France

[email protected]

Abstract

Semi-supervised learning (SSL) uses unlabeled data for training and has been shown to greatly improve performance when compared to a supervised approach on the labeled data available. This claim depends both on the amount of labeled data available and on the algorithm used. In this paper, we compute analytically the gap between the best fully-supervised approach using only labeled data and the best semi-supervised approach using both labeled and unlabeled data. We quantify the best possible increase in performance obtained thanks to the unlabeled data, i.e. we compute the accuracy increase due to the information contained in the unlabeled data. Our work deals with a simple high-dimensional Gaussian mixture model for the data in a Bayesian setting. Our rigorous analysis builds on recent theoretical breakthroughs in high-dimensional inference and a large body of mathematical tools from statistical physics initially developed for spin glasses.

1 Introduction

Semi-supervised learning (SSL) has proven to be a powerful paradigm for leveraging unlabeled data to mitigate the reliance on large labeled datasets. The goal of SSL is to leverage large amounts of unlabeled data to improve the performance of supervised learning over small datasets. For unlabeled examples to be informative, assumptions have to be made. The cluster assumption states that, if two samples belong to the same cluster in the input distribution, then they are likely to belong to the same class. The cluster assumption is the same as the low-density separation assumption: the decision boundary should lie in the low-density region.

In this paper, we explore analytically the simplest possible parametric model for the cluster assumption: the two clusters are modeled by a mixture of two high-dimensional Gaussians with diagonal covariance so that the optimal decision boundary is a hyperplane. Our model can be seen as a classification problem in a semi-supervised setting. Our aim here is to define a model simple enough to be mathematically tractable while being practically relevant and capturing the main properties of a high-dimensional statistical inference problem.

Our model has three parameters: the high-dimensionality of the data is captured by $\alpha$ the ratio of the number of samples divided by the ambient dimension; the fraction of labeled data point $\eta$ and the amount of overlap between the clusters $\sigma^{2}$ . As a function of these three parameters, we compute the best possible accuracy (the Bayes risk) when only labeled data are used or when unlabeled data are also used. As a result, we obtain the added value due to the unlabeled data for the best possible algorithm. In particular, we observe a very clear diminishing return of the labeled data, i.e. the first labeled data points bring much more information than the last ones. Hence the regime with very few labeled data points is a priori a regime favorable to SSL. But in this case, we face in practice the problem of small validation sets [28] which makes hyperparameter tuning impossible.

We find that the range of parameters for which SSL clearly outperforms either unsupervised learning or supervised learning on the labeled data is rather narrow. In a case with large overlap between the clusters ( $\sigma^{2}\to\infty$ ), unsupervised learning fails and supervised learning on the labeled data is almost optimal. In a case with small overlap between the clusters ( $\sigma^{2}\to 0$ ), unsupervised learning achieves performances very close to supervised learning with all labels available while using only the labeled dataset fails.

From a practical perspective, we can try to draw parallels between our results and the state of the art in SSL but we need to keep in mind that our results only give best achievable performances on our toy model. In particular, even in a setting where our results predict that unsupervised learning achieves roughly the same performances as supervised learning with all labels, it might be very useful in practice to use a few labels in addition to all unlabeled data. Such an approach is presented in [8] where extremely good performances are achieved for image classification with only a few labeled data per class and a new SSL algorithm: MixMatch. For example on CIFAR-10, with only 250 labeled images, MixMatch achieves an error rate of 11.08% and with 4000 labeled images, an error rate of 6.24% (to be compared with the 4.17% error rate for the fully supervised training on all 50000 samples). These results are aligned with our finding about diminishing returns of labeled data points.

We make the following contributions:

Bayes risk: to the best of our knowledge, our work is the first analytic computation of the Bayes risk in a high-dimensional Gaussian model in a semi-supervised setting.

Rigorous analysis: our analysis builds on a series of recent works [13, 4, 22, 27, 5] with tools from information theory and mathematical physics originally developed for the analysis of spin glasses [29, 32].

The rest of the paper is organized as follows. Our model and the main result are presented in Section 2. Related work is presented in Section 3. In Section 4, we give an heuristic derivation of the main result and in Section 5, we give a proof sketch while the more technical details are presented in the supplementary material Section 7. We conclude in Section 6

2 Model and main results

We now define our classification problem with two classes. The points ${\bm{Y}}_{1},\dots,{\bm{Y}}_{N}$ of the dataset are in $\mathbb{R}^{D}$ and given by the following process:

[TABLE]

where ${\bm{U}}\sim{\rm Unif}(\mathbb{S}^{D-1})$ , ${\bm{V}}=(V_{1},\dots,V_{N})\overset{\text{\tiny i.i.d.}}{\sim}\text{Unif}(-1,1)$ and ${\bm{Z}}_{1},\dots,{\bm{Z}}_{N}\overset{\text{\tiny i.i.d.}}{\sim}\mathcal{N}(0,\text{Id}_{D})$ are all independent.

In words, the dataset is composed of $N$ points in $\mathbb{R}^{D}$ divided into two classes with roughly equal sizes. The points with label $V_{j}=+1$ are centered around $+{\bm{U}}\in\mathbb{R}^{D}$ and the points with label $V_{j}=-1$ are centered around $-{\bm{U}}\in\mathbb{R}^{D}$ . The parameter $\sigma$ controls the level of Gaussian noise around these centers.

In a semi-supervised setting, the statistician has access to some labels. We consider a case where each label is revealed with probability $\eta\in[0,1]$ independently of everything else. To fix notation, the side information is given by the following process:

[TABLE]

If $S_{j}=0$ , then the label of the $j$ -th data point is unknown whereas if $S_{j}=\pm 1$ , it corresponds to the label of the $j$ -th data point.

Finally, we consider the high-dimensional setting and all our results will be in a regime where $N,D\to\infty$ while the ratio $N/D$ tends towards a constant $\alpha>0$ . Note that we are in a high noise regime since the squared norm of the signal is one whereas the squared norm of the noise is $\sigma^{2}D\approx\sigma^{2}N/\alpha$ where $N$ is the number of observations.

To summarize, the three parameters of our model are: $\sigma^{2}>0$ the variance of the noise in the dataset, $\eta\in[0,1]$ the fraction of revealed labels and $\alpha>0$ the ratio between the number of data points (both labeled and unlabeled) and the dimension of the ambient space. We also assume that the statistician knows the priors, i.e. the distribution of ${\bm{U}},{\bm{V}}$ and ${\bm{Z}}$ .

The task of the statistician is to use the dataset $({\bm{Y}},{\bm{S}})$ in order to make a prediction about the label of a new (unseen) data point. More formally, we define:

[TABLE]

where $V_{\rm new}\sim\text{Unif}(-1,+1)$ , ${\bm{Z}}_{\rm new}\sim\mathcal{N}(0,\text{Id}_{D})$ . We are interested in the minimal achievable error in our model, i.e. the Bayes risk:

[TABLE]

where the infimum is taken over all estimators (measurable functions of ${\bm{Y}},{\bm{S}},{\bm{Y}}_{\rm new}$ ).

Our main mathematical achievement is an analytic formula for the Bayes risk $R^{*}_{D}$ in the large $D$ limit, see Theorem 1 below. In order to state it, we need to introduce some additional notation. We start with some easy facts about our model.

Oracle risk

Assume that the statistician knows the center of the clusters, i.e. has access to the “oracle” vector ${\bm{U}}$ . Then the best classification error would be achieved thanks to the simple thresholding rule ${\rm sign}(\langle{\bm{U}},{\bm{Y}}_{\rm new}\rangle)$ , where $\langle.,.\rangle$ denotes the Euclidean dot product. In this case, the risk is given by:

[TABLE]

where $\Phi$ is the standard Gaussian cumulative distribution function. We have of course $R_{\rm oracle}\leq R^{*}_{D}(\eta)$ .

Fully supervised case

Another instructive and simple case is the supervised case where $\eta=1$ . Since all the $V_{j}$ ’s are known, we can assume wlog that they are all equal to one (multiply each ${\bm{Y}}_{j}$ by $V_{j}$ ). More importantly, if we slightly modify the distribution of ${\bm{U}}$ by taking ${\bm{U}}=(U_{1},\dots U_{D})\overset{\text{\tiny i.i.d.}}{\sim}\mathcal{N}(0,1/D)$ , this will not change the results for our model and makes the analysis easier by decorrelating each component. Indeed, denote by $Y_{j}$ (resp. $Z_{j}$ ) the first component of ${\bm{Y}}_{j}$ (resp. ${\bm{Z}}_{j}$ ) and by $U_{1}$ the first component of ${\bm{U}}$ . Then we have $N$ scalar noisy observations of the first component of ${\bm{U}}$ : $Y_{j}=U_{1}+\sigma Z_{j}$ for $1\leq j\leq N$ , so that we can construct an estimate for $U_{1}$ by taking the average of the observations. We get:

[TABLE]

Doing this for each component of ${\bm{U}}$ , we get an estimate of the vector ${\bm{U}}$ and we now use it to get an estimate of $V_{\rm new}$ . First define $\overline{{\bm{Y}}}=(\overline{Y}_{1},\dots,\overline{Y}_{D})$ and consider

[TABLE]

and note that as $D\to\infty$ , we have $\langle{\bm{U}},\overline{{\bm{Y}}}\rangle\approx D\mathbb{E}[U_{1}\overline{Y}_{1}]=D\mathbb{E}\left[U_{1}^{2}\right]=1$ and $\langle{\bm{Z}}_{\rm new},\overline{{\bm{Y}}}\rangle\approx\sqrt{\mathbb{E}\left[\|\overline{{\bm{Y}}}\|^{2}\right]}Z\approx\sqrt{\alpha+\sigma^{2}}/\sqrt{\alpha}\mathcal{N}(0,1)$ , so that we get:

[TABLE]

Our main result will actually show that estimating $V_{\rm new}$ with the sign of $\langle{\bm{Y}}_{\rm new},\overline{{\bm{Y}}}\rangle$ is optimal so that we get:

[TABLE]

A similar result was obtained in [15] in a case where the covariance structure of the noise needs also to be estimated by the statistician resulting in a multiplicative term inside the $\Phi(.)$ function (see Theorem 3.1 and Corollary 3.3 in [15]).

Unsupervised case

In this paper, we concentrate on the case where $\eta>0$ . When $\eta=0$ , there is no side information and we are in an unsupervised setting studied in [27]. Due to the symmetry of our model, we have $R^{*}_{D}=1/2$ because there is no way to guess the right classes $\pm 1$ . In order to have a well-posed problem, the risk should be redefined as follows:

[TABLE]

Although, this measure of performance is not the one studied in [27], we can adapt the argument to show that:

[TABLE]

Main result

We now state our main result:

Theorem 1.

Let us define, for $\alpha,\sigma>0$ , $\eta\in(0,1]$ ,

[TABLE]

Here $\mathsf{i}_{v}(\gamma)=\gamma-\mathbb{E}\log\cosh(\sqrt{\gamma}Z_{0}+\gamma)$ where $Z_{0}\sim\mathcal{N}(0,1)$ . The function $f_{\alpha,\sigma,\eta}$ admits a unique minimizer $q^{*}(\alpha,\sigma,\eta)$ on $[0,1)$ and

[TABLE]

As a result from our proof, we will prove that a very simple algorithm is optimal (asymptotically in $N,D$ ). Namely, we define $\widebar{{\bm{u}}}=\mathbb{E}[{\bm{U}}|{\bm{Y}},{\bm{S}}]$ , the posterior mean of ${\bm{U}}$ given the observations. Then taking $\widehat{v}={\rm sign}(\langle\overline{{\bm{u}}},{\bm{Y}}_{\rm new}\rangle)$ is an optimal estimator of $V_{\rm new}$ in the sense that

[TABLE]

Of course, from a practical point of view, computing an estimate for $\widebar{{\bm{u}}}=\mathbb{E}[{\bm{U}}|{\bm{Y}},{\bm{S}}]$ is not an easy task (except in the supervised setting) and approximations need to be made.

Figure 1 gives examples of our main results with comparison of the various settings. The semi-supervised curve corresponds to the formula (5) where a fraction $\eta$ of the data points have labels and are used with all unlabeled data points. The supervised on full curve corresponds to (2) where all the labels are used. The supervised on labeled curve corresponds to (2) with the parameter $\alpha$ replaced by $\alpha\eta$ and is the best possible performance when only a fraction $\eta$ of the data points having labels are used. The unsupervised curve corresponds to (3) where all the data points are used but without any label. Finally, the oracle curve corresponds to (1) where the centers of the clusters are known (corresponding to the case $\alpha\to\infty$ ). In the left of Figure 1, we clearly see that the first labeled data points (i.e. when $\eta$ is small) decreases greatly the risk of semi-supervised learning. This corresponds to the diminishing return of the labeled data. In the right plot of Figure 1, we see that in the high-noise regime, unsupervised learning fails and that its risk decreases as soon as $\sigma^{2}<1$ . This phenomena is known as the BBP phase transition [1, 2, 30]. We see that below this transition, the unlabeled data are of little help as the performance of SSL almost match the performance of supervised learning on labeled data only. Moreover after the transition, unsupervised learning reaches quite quickly the performance of SSL. In other words, the regime most favorable to SSL in term of noise corresponds precisely to the regime around the BBP phase transition where unsupervised learning is still not very good while supervised learning on labeled data saturates.

3 Related work

The unsupervised version of our problem is the standard Gaussian mixture model used in statistics [16]. In the regime considered here (dimension and number of samples tending to infinity), there are a number of recent works dealing with the clustering problem of Gaussian mixtures. However, a large part of them considers scenarios where $\alpha\to\infty$ or $\sigma\to 0$ . In the regime where $\alpha=O(1)$ i.e. where the number of observations is proportional to the dimension, spectral clustering has been extensively studied. In this regime it is known that the leading eigenvector of the sample covariance matrix encounters a phase transition [1, 2, 30]: there exists a critical value of the noise intensity below which the leading eigenvector starts to be correlated with the centers of the clusters. Using exact but non-rigorous methods from statistical physics, [6, 24] determine the critical values for $\alpha$ and $\sigma$ at which it becomes information-theoretically possible to reconstruct the membership into clusters better than chance. Rigorous results on this model are given in [3] where bounds on the critical values are obtained. The precise thresholds were then determined in [27]. Our analysis builds on the techniques derived in this last reference with two main modifications: additional work is required to compute the classification accuracy (as opposed to the mean squared error) and to incorporate the side information.

To the best of our knowledge, there are much fewer theoretical works dealing with a semi-supervised learning in a high-dimensional setting. [9] shows the error converges exponentially fast in the number of labeled examples if the mixture model is identifiable (see [31] for an extension of these results). [10] studies a mixture model where the estimation problem is essentially reduced to the one of estimating the mixing parameter and shows that the information content of unlabeled examples decreases as classes overlap. [12] shows that unlabeled data can lead to an increase in classification error in a case where the model is incorrect. A similar conclusion is obtained in [21] for linear classifiers defined by convex margin-based sur-rogate losses. In contrast, our work computes the asymptotic Bayes risk for which unlabeled data can only improve the best achievable performance. More closely related to our work, [14] provides the first information theoretic tight analysis for inference of latent community structure given a dense graph along with high dimensional node covariates, correlated with the same latent communities. [25] studies a class of graph-oriented semi-supervised learning algorithms in the limit of large and numerous data similar to our setting.

In contrast, there are a number of practical works and proposed algorithms for semi-supervised learning based on transductive models [19], graph-based method [34] or generative modeling [7], see the surveys [35] and [11]. SSL methods based on training a neural network by adding an additional loss term to ensure consistency regularization are presented in [17], [20], [33]. We refer in particular to the recent work [28] for an overview of these SSL methods (currently the state-of-the-art for SSL on image classification datasets). The algorithm MixMAtch introduced in [8] obtains impressive results on all standard image benchmarks. Given these recent improvements, natural questions arise: what is the best possible achievable performance? to what extend can we generalize those improvement to other domains? We believe that our work is a first step in a theoretical understanding of these questions.

4 Heuristic derivation of the main result

We present now an heuristic derivation of our results, based on the “cavity method” [26] from statistical physics. Let $\widebar{{\bm{u}}}=\mathbb{E}[{\bm{U}}|{\bm{Y}},{\bm{S}}]$ and $\widebar{{\bm{v}}}=\mathbb{E}[{\bm{V}}|{\bm{Y}},{\bm{S}}]$ be the optimal estimators (in term of mean squared error) for estimating ${\bm{U}}$ and ${\bm{V}}$ . A natural hypothesis is to assume that the correlation $\langle\widebar{{\bm{u}}},{\bm{U}}\rangle$ converges as $N,D\to\infty$ to some deterministic limit $q_{u}^{*}\in[0,1]$ and that $\frac{1}{N}\langle\widebar{{\bm{v}}},{\bm{V}}\rangle\to q_{v}^{*}\in\mathbb{R}$ .

The conditional expectation $\widebar{{\bm{u}}}=\mathbb{E}[{\bm{U}}|{\bm{Y}},{\bm{S}}]$ is the orthogonal projection (in $L^{2}$ sense) of the random vector ${\bm{U}}$ onto the subspace of ${\bm{Y}},{\bm{S}}$ -measurable random variables. The squared $L^{2}$ norm of the projection $\widebar{{\bm{u}}}$ is equal to the scalar product of the vector ${\bm{U}}$ with its projection $\widebar{{\bm{u}}}$ : $\mathbb{E}\|\widebar{{\bm{u}}}\|^{2}=\mathbb{E}\langle\widebar{{\bm{u}}},{\bm{U}}\rangle$ . Assuming that $\|\widebar{{\bm{u}}}\|^{2}$ also admits a deterministic limit, this limits is then equal to $q_{u}^{*}$ . We get for large $N$ and $D$ , $\|\widebar{{\bm{u}}}\|^{2}\simeq\langle\widebar{{\bm{u}}},{\bm{U}}\rangle\simeq q_{u}^{*}$ . Analogously we have $\frac{1}{N}\|\widebar{{\bm{v}}}\|^{2}\simeq\frac{1}{N}\langle\widebar{{\bm{v}}},{\bm{V}}\rangle\simeq q_{v}^{*}$ .

We will show below that $q_{u}^{*}$ and $q_{v}^{*}$ obey some fixed point equations that allow to determine them.

As seen above, if we aim at estimating a label $V_{i}$ that we did not observe (i.e. $S_{i}=0$ ) given ${\bm{Y}},{\bm{S}}$ and the “oracle” ${\bm{U}}$ , we compute the sufficient statistic $\widetilde{Y}_{i}=\langle{\bm{Y}}_{i},{\bm{U}}\rangle=V_{i}+\sigma\mathcal{N}(0,1)$ . The estimator that minimizes the probability of error $\mathbb{P}(\widehat{v}\neq V_{i})$ is simply $\widehat{v}_{i}={\rm sign}(\widetilde{Y}_{i})$ . The one that minimizes the mean squared error (MSE) is $\widehat{v}_{i}=\mathbb{E}[V_{i}|\widetilde{Y}_{i}]$ which achieves a MSE of

[TABLE]

where we define for $(V,Z)\sim{\rm Unif}(-1,+1)\otimes\mathcal{N}(0,1)$ and $\gamma>0$ (see Section 7.1 for more details):

[TABLE]

In the case where we do not have access to the oracle ${\bm{U}}$ , one can still use $\widebar{{\bm{u}}}$ as a proxy. We repeat the same procedure assuming that $\langle\widebar{{\bm{u}}},{\bm{Y}}_{i}\rangle$ is a sufficient statistic for estimating $V_{i}$ . Although this is not strictly true, we shall see that this leads to the correct fixed point equations for $q_{u}^{*},q_{v}^{*}$ . Compute

[TABLE]

The posterior mean $\widebar{{\bm{u}}}$ is not expected to depend much on the particular point ${\bm{Y}}_{i}$ and therefore on ${\bm{Z}}_{i}$ . This gives that the random vectors $\widebar{{\bm{u}}}$ and ${\bm{Z}}_{i}$ are approximately independent. Hence the distribution of $\langle\widebar{{\bm{u}}},{\bm{Z}}_{i}\rangle$ is roughly $\mathcal{N}(0,q_{u}^{*})$ (we recall that $\|\widebar{{\bm{u}}}\|^{2}\simeq q_{u}^{*}$ ). We get

[TABLE]

in law, where $Z\sim\mathcal{N}(0,1)$ . The best estimator $\widehat{v}_{i}$ (in terms of MSE) one can then construct using $\langle\widebar{{\bm{u}}},{\bm{Y}}_{i}\rangle$ achieves a MSE of

[TABLE]

We assumed that $\langle\widebar{{\bm{u}}},{\bm{Y}}_{i}\rangle$ is a sufficient statistic for estimating $V_{i}$ , therefore $\widehat{v}_{i}=\widebar{v}_{i}$ . For all the $\eta N$ indices $i$ such that $S_{i}=V_{i}$ we have obviously $\widebar{v}_{i}=V_{i}$ . Hence

[TABLE]

Since we have $\frac{1}{N}\mathbb{E}\|{\bm{V}}-\widebar{{\bm{v}}}\|^{2}\simeq 1-q_{v}^{*}$ , we get

[TABLE]

We can do the same reasoning with $\widebar{{\bm{u}}}$ instead of $\widebar{{\bm{v}}}$ . We denote by ${\bm{R}}_{i}$ (resp. $\tilde{{\bm{Z}}}_{i}$ ) the $i$ -th row of the matrix ${\bm{Y}}$ (resp. ${\bm{Z}}$ ), so that we have

[TABLE]

Hence taking the scalar product with $\frac{1}{N}\widebar{{\bm{v}}}$ gives

[TABLE]

in law, where $Z\sim\mathcal{N}(0,1)$ . Recall that $D\mathbb{E}[U_{i}^{2}]=1$ . Making the same assumption as above, the best estimator $\widehat{u}_{i}$ one can construct using $\langle\widebar{{\bm{v}}},{\bm{R}}_{i}\rangle$ achieves a MSE of

[TABLE]

where ${{\rm mmse}}_{u}(\gamma)=\mathbb{E}[(U-\mathbb{E}[U|\sqrt{\gamma}U+Z])^{2}]$ for $U,Z\overset{\text{\tiny i.i.d.}}{\sim}\mathcal{N}(0,1)$ . This leads to

[TABLE]

As shown in Section 7.1, we have

[TABLE]

We conclude that $(q_{u}^{*},q_{v}^{*})$ satisfies the following fixed point equations:

[TABLE]

We introduce the following mutual information

[TABLE]

where $V_{0}\sim{\rm Unif}(-1,+1)$ and $Z_{0}\sim\mathcal{N}(0,1)$ are independent. An elementary computation leads to (see Section 7.1)

[TABLE]

By the “I-MMSE” Theorem from [18], $\mathsf{i}_{v}$ is related to ${{\rm mmse}}_{v}$ :

[TABLE]

Let us compute the derivative of $f_{\alpha,\sigma,\eta}$ defined by (4), using (11):

[TABLE]

Using (7)-(8), one verifies easily that $f^{\prime}_{\alpha,\sigma,\eta}(q_{u}^{*})=0$ . By Proposition 1 (proved in Section 7.3), $f_{\alpha,\sigma,\eta}$ admits a unique critical point on $[0,1)$ which is its unique minimizer: $q_{u}^{*}$ is therefore the minimizer of $f_{\alpha,\sigma,\eta}$ .

If we now want to estimate $V_{\rm new}$ from ${\bm{Y}},{\bm{S}}$ and ${\bm{Y}}_{\rm new}$ we assume, as above that $\langle\widebar{{\bm{u}}},{\bm{Y}}_{\rm new}\rangle$ is a sufficient statistic. As for (6), we have

[TABLE]

in law, where $Z\sim\mathcal{N}(0,1)$ is independent of $V_{\rm new}$ . The Bayes classifier is then

[TABLE]

hence

[TABLE]

which is the statement of our main Theorem 1 above.

5 Proof sketch

From now we simply write $q^{*}$ instead of $q^{*}(\alpha,\sigma,\eta)$ . The next theorem computes the limit of the log-likelihood ratio.

Theorem 2.

Conditionally on $V_{\rm new}=\pm 1$ ,

[TABLE]

Before sketching the proof of this theorem, we show how Theorem 1 follows from Theorem 2 above. The optimal estimator for $V_{\rm new}$ is given by the sign of the log-likelihood ratio, hence we get:

[TABLE]

Proof.

Let us look at the posterior distribution of $V_{\rm new},{\bm{U}}$ given ${\bm{Y}},{\bm{S}},{\bm{Y}}_{\rm new}$ , i.e. From Bayes rule we get

[TABLE]

Let $\widebar{{\bm{u}}}=\mathbb{E}[{\bm{U}}|{\bm{Y}},{\bm{S}}]$ . The following lemma is proved in the supplementary material, see Section 7.3.

Lemma 1.

Let ${\bm{u}}^{(1)},{\bm{u}}^{(2)}$ be i.i.d. samples from the posterior distribution of ${\bm{U}}$ given ${\bm{Y}},{\bm{S}}$ , independently of everything else. Then

[TABLE]

For $v\in\{-1,+1\}$ we define

[TABLE]

Using Lemma 1, we prove the following lemma in Section 7.3

Lemma 2.

For $v=\pm 1$ , $A_{N}(v)-B_{N}(v)\xrightarrow[N,D\to\infty]{L^{2}}0$ .

Since $|\log A_{N}(v)-\log B_{N}(v)|\leq(A_{N}(v)^{-1}+B_{N}(v)^{-1})(A_{N}(v)-B_{N}(v))$ , we have by Cauchy-Schwarz inequality:

[TABLE]

using Lemma 2 (one can verify easily that the first term of the product above is $O(1)$ ). We get $\log A_{N}(v)-\log B_{N}(v)\xrightarrow[N,D\to\infty]{L^{1}}0$ , hence

[TABLE]

$\widebar{{\bm{u}}}$ is independent of $({\bm{V}}_{\rm new},{\bm{Z}}_{\rm new})$ and by Lemma 1 we have $\|\widebar{{\bm{u}}}\|^{2}\to q^{*}$ . Consequently $\langle\widebar{{\bm{u}}},{\bm{Z}}_{\rm new}\rangle\xrightarrow[N,D\to\infty]{(d)}\mathcal{N}(0,q^{*})$ and we conclude:

[TABLE]

where $Z_{0}\sim\mathcal{N}(0,q^{*})$ is independent of $V_{\rm new}$ . ∎

6 Conclusion

We analyzed a simple high-dimensional Gaussian mixture model in a semi-supervised setting and computed the associated Bayes risk. In our model, we are able to compute the best possible accuracy of semi-supervised learning using both labeled and unlabeled data as well as the best possible performances of supervised learning using only the labeled data and unsupervised learning using all data but without any label. This allows us to quantify the added value of unlabeled data. When the clusters are well separated (probably the most realistic setting), we find that the value of unlabeled data is dominating. Labeled data can almost be ignored as unsupervised learning achieved roughly the same performance as semi-supervised learning. Nevertheless, using a few labeled data is often very helpful in practice as shown by the recent MixMatch algorithm [8].

We believe our main Theorem 1 gives new insights for semi-supervised learning and we designed our model with a focus on simplicity. However, our proof technique is very general and can handle a much more complex model. For example, we can deal with classes of different sizes by changing the prior of $V_{\rm new}$ . Another extension for which our proof carries over consists in modifying the channel for the side information. Here, we considered the erasure channel corresponding to the standard SSL setting but our proof will still work for other channel like the binary symmetric channel or the Z channel corresponding to a setting with noisy labels.

7 Supplementary material

7.1 Gaussian channel

We give here some easy computation for the Gaussian channel:

[TABLE]

where $Z\sim\mathcal{N}(0,1)$ is independent of $U$ .

We first consider the case where $U\sim\mathcal{N}(0,1)$ . We define ${{\rm mmse}}_{u}(\gamma)=\mathbb{E}\left[\left(U-\mathbb{E}\left[U|Y\right]\right)^{2}\right]$ . Since, we are dealing with Gaussian random variables, $\mathbb{E}\left[U|Y\right]$ is simply the orthogonal projection of $U$ on $Y$ :

[TABLE]

Hence, we have

[TABLE]

Thanks to the I-MMSE relation [18], we have $\frac{1}{2}{{\rm mmse}}_{u}(\gamma)=\frac{\partial}{\partial\gamma}I(U;Y)$ . For $\gamma=0$ , $U$ and $Y$ are independent: $I(U;Y)_{\gamma=0}=0$ , so that we get

[TABLE]

We now consider the case where $U\sim{\rm Unif}(-1,+1)$ . We define $\mathsf{i}_{v}(\gamma)=I(U;Y)$ . Recall that

[TABLE]

And here, we have

[TABLE]

Hence, we have

[TABLE]

Thanks to the I-MMSE relation, we have:

[TABLE]

so that we have ${{\rm mmse}}_{v}(\gamma)=1-\mathbb{E}\tanh\left(\sqrt{\lambda}Z+\lambda\right)$ .

7.2 Convergence of the mutual information

Theorem 3.

For all $\alpha,\sigma>0$ , $\eta\in(0,1]$ ,

[TABLE]

Further, this minimum is achieved at a unique point $q^{*}(\alpha,\sigma,\eta)$ and

[TABLE]

where ${\bm{u}}$ is a sample from the posterior distribution of ${\bm{U}}$ given ${\bm{Y}},{\bm{S}}$ , independently of everything else.

Proof.

The limit (13) was proved in [27] in the case $\eta=0$ . The proof can however be straightforwardly adapted to the case $\eta\neq 0$ and leads to

[TABLE]

where $\mathsf{i}_{u}(\gamma)=\frac{1}{2}\log(1+\gamma)$ . The supremum in $q_{v}$ can be easily computed, leading to:

[TABLE]

This proves (13). The fact that $f_{\alpha,\sigma,\eta}$ admits a unique minimizer $q_{u}^{*}(\alpha,\sigma,\eta)$ comes from Proposition 1.

From the limit of the mutual information, one gets the limits of minimal mean squared errors (MMSE) using the “I-MMSE” relation [18]:

[TABLE]

Let ${\bm{u}}$ be a sample from the posterior distribution of ${\bm{U}}$ given ${\bm{Y}},{\bm{S}}$ , independently of everything else. Then we deduce

[TABLE]

In order to show that $\langle{\bm{u}},{\bm{U}}\rangle\xrightarrow[n\to\infty]{}q_{u}^{*}(\alpha,\sigma,\eta)$ it remains to show that

[TABLE]

This can be done (as in [5]) by adding a small amount of additional side-information to the model of the form ${\bm{Y}}=\sqrt{\epsilon D}{\bm{U}}^{\otimes 4}+{\bm{W}}$ , where the entries of the tensor $W$ are i.i.d. standard Gaussian: $(W_{i_{1},i_{2},i_{3},i_{4}})_{1\leq i_{1},i_{2},i_{3},i_{4}\leq D}\overset{\text{\tiny i.i.d.}}{\sim}\mathcal{N}(0,1)$ . We then apply the I-MMSE relation with respect to $\epsilon$ to obtain (15). ∎

7.3 Technical lemmas

We now give the proof of Lemma 1

Proof.

Notice that, by Bayes rule, we have $({\bm{U}},{\bm{u}}^{(1)})\overset{{\rm(d)}}{=}({\bm{u}}^{(2)},{\bm{u}}^{(1)})$ . So we have by (14) $\langle{\bm{U}},{\bm{u}}^{(1)}\rangle,\langle{\bm{u}}^{(1)},{\bm{u}}^{(2)}\rangle\xrightarrow[N,D\to\infty]{}q^{*}$ . Now, by Jensen’s inequality:

[TABLE]

Since $\mathbb{E}[\langle{\bm{u}}^{(1)},{\bm{u}}^{(2)}\rangle|{\bm{Y}},{\bm{S}}]=\langle\mathbb{E}[{\bm{u}}^{(1)}|{\bm{Y}},{\bm{S}}],\mathbb{E}[{\bm{u}}^{(2)}|{\bm{Y}},{\bm{S}}]\rangle=\|\widebar{{\bm{u}}}\|^{2}$ and $\mathbb{E}[\langle{\bm{u}}^{(1)},{\bm{u}}^{(2)}\rangle|{\bm{Y}},{\bm{S}},{\bm{u}}^{(1)}]=\langle{\bm{u}}^{(1)},\mathbb{E}[{\bm{u}}^{(2)}|{\bm{Y}},{\bm{S}}]\rangle=\langle{\bm{u}}^{(1)},\widebar{{\bm{u}}}\rangle$ , this leads to

[TABLE]

∎

We now give a proof of Lemma 2.

Proof.

In order to prove that $A_{N}(v)-B_{N}(v)\xrightarrow[N,D\to\infty]{L^{2}}0$ , it suffices to show that $\lim\mathbb{E}[A_{N}(v)^{2}]=\lim\mathbb{E}[B_{N}(v)^{2}]=\lim\mathbb{E}[A_{N}(v)B_{N}(v)]$ . Let ${\bm{u}}^{(1)},\dots\bm{u}^{(2)}$ be i.i.d. samples from the posterior distribution of ${\bm{U}}$ given ${\bm{Y}},{\bm{S}}$ , independently of everything else. Using Lemma 1, we compute:

[TABLE]

Integrating with respect to ${\bm{Z}}_{\rm new}\sim\mathcal{N}(0,{\rm Id}_{D})$ only, we get

[TABLE]

where the last limit follows from Lemma 1. Following the same steps we compute:

[TABLE]

and

[TABLE]

The three limits above are the same, the Lemma is proved. ∎

Proposition 1.

For all $\alpha,\sigma>0$ and all $\eta\in(0,1]$ , the function $f_{\alpha,\sigma,\eta}$ admits a unique critical point which is its unique minimizer on $[0,1)$ .

Proof.

Recall that $\mathsf{i}_{v}(\gamma)=\gamma-\mathbb{E}\log\cosh(\sqrt{\gamma}Z+\gamma)$ , where $Z\sim\mathcal{N}(0,1)$ . A computation gives $\mathsf{i}^{\prime}_{v}(\gamma)=\frac{1}{2}(1-\mathbb{E}\tanh(\sqrt{\gamma}Z+\gamma))$ . We define,

[TABLE]

where $Z\sim\mathcal{N}(0,1)$ . Hence

[TABLE]

The critical points of $f_{\alpha,\sigma,\eta}$ are solution of

[TABLE]

As proved in [13, Lemma 6.1], the function $h$ is concave. This gives that $F$ is concave. Since $F$ is upper-bounded by $1$ and $F(0)=\frac{\alpha\eta}{\sigma^{2}+\alpha\eta}>0$ , we get that $F$ admits a unique fixed point on $[0,1]$ . The function $f_{\alpha,\sigma,\eta}$ admits therefore a unique critical point on $[0,1)$ which is necessarily a minimum since $f^{\prime}_{\alpha,\sigma,\eta}(0)=-\frac{\eta\alpha}{2\sigma}$ and $\lim_{q\to 1}f^{\prime}_{\alpha,\sigma,\eta}(q)=+\infty$ ∎

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Baik, G. B. Arous, S. Péché, et al. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability , 33(5):1643–1697, 2005.
2[2] J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis , 97(6):1382–1408, 2006.
3[3] J. Banks, C. Moore, R. Vershynin, N. Verzelen, and J. Xu. Information-theoretic bounds and phase transitions in clustering, sparse pca, and submatrix localization. IEEE Transactions on Information Theory , 64(7):4872–4894, 2018.
4[4] J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, L. Zdeborová, et al. Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula. In Advances in Neural Information Processing Systems , pages 424–432, 2016.
5[5] J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborová. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences , 116(12):5451–5460, 2019.
6[6] N. Barkai and H. Sompolinsky. Statistical mechanics of the maximum-likelihood density estimation. Physical Review E , 50(3):1766, 1994.
7[7] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems , pages 585–591, 2002.
8[8] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel. Mixmatch: A holistic approach to semi-supervised learning. ar Xiv preprint ar Xiv:1905.02249 , 2019.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Asymptotic Bayes risk for Gaussian mixture in a semi-supervised setting

Abstract

1 Introduction

2 Model and main results

Oracle risk

Fully supervised case

Unsupervised case

Main result

Theorem 1**.**

3 Related work

4 Heuristic derivation of the main result

5 Proof sketch

Theorem 2**.**

Proof.

Lemma 1**.**

Lemma 2**.**

6 Conclusion

7 Supplementary material

7.1 Gaussian channel

7.2 Convergence of the mutual information

Theorem 3**.**

Proof.

7.3 Technical lemmas

Proof.

Proof.

Proposition 1**.**

Proof.

Theorem 1.

Theorem 2.

Lemma 1.

Lemma 2.

Theorem 3.

Proposition 1.