Asymptotic Bayes risk for Gaussian mixture in a semi-supervised setting
Marc Lelarge, Leo Miolane

TL;DR
This paper analytically quantifies the performance gain in semi-supervised learning over supervised learning for Gaussian mixture models in high dimensions, using advanced mathematical tools from statistical physics.
Contribution
It provides the first rigorous analysis of the asymptotic Bayes risk gap in semi-supervised Gaussian mixture models, leveraging recent high-dimensional inference theories.
Findings
Quantifies the accuracy improvement due to unlabeled data.
Provides explicit formulas for the Bayes risk gap.
Demonstrates the impact of unlabeled data in high-dimensional settings.
Abstract
Semi-supervised learning (SSL) uses unlabeled data for training and has been shown to greatly improve performance when compared to a supervised approach on the labeled data available. This claim depends both on the amount of labeled data available and on the algorithm used. In this paper, we compute analytically the gap between the best fully-supervised approach using only labeled data and the best semi-supervised approach using both labeled and unlabeled data. We quantify the best possible increase in performance obtained thanks to the unlabeled data, i.e. we compute the accuracy increase due to the information contained in the unlabeled data. Our work deals with a simple high-dimensional Gaussian mixture model for the data in a Bayesian setting. Our rigorous analysis builds on recent theoretical breakthroughs in high-dimensional inference and a large body of mathematical tools from…
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Asymptotic Bayes risk for Gaussian mixture in a semi-supervised setting
Marc Lelarge
INRIA-ENS
Paris, France
&Léo Miolane
INRIA-ENS
Paris, France
Abstract
Semi-supervised learning (SSL) uses unlabeled data for training and has been shown to greatly improve performance when compared to a supervised approach on the labeled data available. This claim depends both on the amount of labeled data available and on the algorithm used. In this paper, we compute analytically the gap between the best fully-supervised approach using only labeled data and the best semi-supervised approach using both labeled and unlabeled data. We quantify the best possible increase in performance obtained thanks to the unlabeled data, i.e. we compute the accuracy increase due to the information contained in the unlabeled data. Our work deals with a simple high-dimensional Gaussian mixture model for the data in a Bayesian setting. Our rigorous analysis builds on recent theoretical breakthroughs in high-dimensional inference and a large body of mathematical tools from statistical physics initially developed for spin glasses.
1 Introduction
Semi-supervised learning (SSL) has proven to be a powerful paradigm for leveraging unlabeled data to mitigate the reliance on large labeled datasets. The goal of SSL is to leverage large amounts of unlabeled data to improve the performance of supervised learning over small datasets. For unlabeled examples to be informative, assumptions have to be made. The cluster assumption states that, if two samples belong to the same cluster in the input distribution, then they are likely to belong to the same class. The cluster assumption is the same as the low-density separation assumption: the decision boundary should lie in the low-density region.
In this paper, we explore analytically the simplest possible parametric model for the cluster assumption: the two clusters are modeled by a mixture of two high-dimensional Gaussians with diagonal covariance so that the optimal decision boundary is a hyperplane. Our model can be seen as a classification problem in a semi-supervised setting. Our aim here is to define a model simple enough to be mathematically tractable while being practically relevant and capturing the main properties of a high-dimensional statistical inference problem.
Our model has three parameters: the high-dimensionality of the data is captured by the ratio of the number of samples divided by the ambient dimension; the fraction of labeled data point and the amount of overlap between the clusters . As a function of these three parameters, we compute the best possible accuracy (the Bayes risk) when only labeled data are used or when unlabeled data are also used. As a result, we obtain the added value due to the unlabeled data for the best possible algorithm. In particular, we observe a very clear diminishing return of the labeled data, i.e. the first labeled data points bring much more information than the last ones. Hence the regime with very few labeled data points is a priori a regime favorable to SSL. But in this case, we face in practice the problem of small validation sets [28] which makes hyperparameter tuning impossible.
We find that the range of parameters for which SSL clearly outperforms either unsupervised learning or supervised learning on the labeled data is rather narrow. In a case with large overlap between the clusters (), unsupervised learning fails and supervised learning on the labeled data is almost optimal. In a case with small overlap between the clusters (), unsupervised learning achieves performances very close to supervised learning with all labels available while using only the labeled dataset fails.
From a practical perspective, we can try to draw parallels between our results and the state of the art in SSL but we need to keep in mind that our results only give best achievable performances on our toy model. In particular, even in a setting where our results predict that unsupervised learning achieves roughly the same performances as supervised learning with all labels, it might be very useful in practice to use a few labels in addition to all unlabeled data. Such an approach is presented in [8] where extremely good performances are achieved for image classification with only a few labeled data per class and a new SSL algorithm: MixMatch. For example on CIFAR-10, with only 250 labeled images, MixMatch achieves an error rate of 11.08% and with 4000 labeled images, an error rate of 6.24% (to be compared with the 4.17% error rate for the fully supervised training on all 50000 samples). These results are aligned with our finding about diminishing returns of labeled data points.
We make the following contributions:
Bayes risk: to the best of our knowledge, our work is the first analytic computation of the Bayes risk in a high-dimensional Gaussian model in a semi-supervised setting.
Rigorous analysis: our analysis builds on a series of recent works [13, 4, 22, 27, 5] with tools from information theory and mathematical physics originally developed for the analysis of spin glasses [29, 32].
The rest of the paper is organized as follows. Our model and the main result are presented in Section 2. Related work is presented in Section 3. In Section 4, we give an heuristic derivation of the main result and in Section 5, we give a proof sketch while the more technical details are presented in the supplementary material Section 7. We conclude in Section 6
2 Model and main results
We now define our classification problem with two classes. The points of the dataset are in and given by the following process:
[TABLE]
where , and are all independent.
In words, the dataset is composed of points in divided into two classes with roughly equal sizes. The points with label are centered around and the points with label are centered around . The parameter controls the level of Gaussian noise around these centers.
In a semi-supervised setting, the statistician has access to some labels. We consider a case where each label is revealed with probability independently of everything else. To fix notation, the side information is given by the following process:
[TABLE]
If , then the label of the -th data point is unknown whereas if , it corresponds to the label of the -th data point.
Finally, we consider the high-dimensional setting and all our results will be in a regime where while the ratio tends towards a constant . Note that we are in a high noise regime since the squared norm of the signal is one whereas the squared norm of the noise is where is the number of observations.
To summarize, the three parameters of our model are: the variance of the noise in the dataset, the fraction of revealed labels and the ratio between the number of data points (both labeled and unlabeled) and the dimension of the ambient space. We also assume that the statistician knows the priors, i.e. the distribution of and .
The task of the statistician is to use the dataset in order to make a prediction about the label of a new (unseen) data point. More formally, we define:
[TABLE]
where , . We are interested in the minimal achievable error in our model, i.e. the Bayes risk:
[TABLE]
where the infimum is taken over all estimators (measurable functions of ).
Our main mathematical achievement is an analytic formula for the Bayes risk in the large limit, see Theorem 1 below. In order to state it, we need to introduce some additional notation. We start with some easy facts about our model.
Oracle risk
Assume that the statistician knows the center of the clusters, i.e. has access to the “oracle” vector . Then the best classification error would be achieved thanks to the simple thresholding rule , where denotes the Euclidean dot product. In this case, the risk is given by:
[TABLE]
where is the standard Gaussian cumulative distribution function. We have of course .
Fully supervised case
Another instructive and simple case is the supervised case where . Since all the ’s are known, we can assume wlog that they are all equal to one (multiply each by ). More importantly, if we slightly modify the distribution of by taking , this will not change the results for our model and makes the analysis easier by decorrelating each component. Indeed, denote by (resp. ) the first component of (resp. ) and by the first component of . Then we have scalar noisy observations of the first component of : for , so that we can construct an estimate for by taking the average of the observations. We get:
[TABLE]
Doing this for each component of , we get an estimate of the vector and we now use it to get an estimate of . First define and consider
[TABLE]
and note that as , we have and , so that we get:
[TABLE]
Our main result will actually show that estimating with the sign of is optimal so that we get:
[TABLE]
A similar result was obtained in [15] in a case where the covariance structure of the noise needs also to be estimated by the statistician resulting in a multiplicative term inside the function (see Theorem 3.1 and Corollary 3.3 in [15]).
Unsupervised case
In this paper, we concentrate on the case where . When , there is no side information and we are in an unsupervised setting studied in [27]. Due to the symmetry of our model, we have because there is no way to guess the right classes . In order to have a well-posed problem, the risk should be redefined as follows:
[TABLE]
Although, this measure of performance is not the one studied in [27], we can adapt the argument to show that:
[TABLE]
Main result
We now state our main result:
Theorem 1**.**
Let us define, for , ,
[TABLE]
Here where . The function admits a unique minimizer on and
[TABLE]
As a result from our proof, we will prove that a very simple algorithm is optimal (asymptotically in ). Namely, we define , the posterior mean of given the observations. Then taking is an optimal estimator of in the sense that
[TABLE]
Of course, from a practical point of view, computing an estimate for is not an easy task (except in the supervised setting) and approximations need to be made.
Figure 1 gives examples of our main results with comparison of the various settings. The semi-supervised curve corresponds to the formula (5) where a fraction of the data points have labels and are used with all unlabeled data points. The supervised on full curve corresponds to (2) where all the labels are used. The supervised on labeled curve corresponds to (2) with the parameter replaced by and is the best possible performance when only a fraction of the data points having labels are used. The unsupervised curve corresponds to (3) where all the data points are used but without any label. Finally, the oracle curve corresponds to (1) where the centers of the clusters are known (corresponding to the case ). In the left of Figure 1, we clearly see that the first labeled data points (i.e. when is small) decreases greatly the risk of semi-supervised learning. This corresponds to the diminishing return of the labeled data. In the right plot of Figure 1, we see that in the high-noise regime, unsupervised learning fails and that its risk decreases as soon as . This phenomena is known as the BBP phase transition [1, 2, 30]. We see that below this transition, the unlabeled data are of little help as the performance of SSL almost match the performance of supervised learning on labeled data only. Moreover after the transition, unsupervised learning reaches quite quickly the performance of SSL. In other words, the regime most favorable to SSL in term of noise corresponds precisely to the regime around the BBP phase transition where unsupervised learning is still not very good while supervised learning on labeled data saturates.
3 Related work
The unsupervised version of our problem is the standard Gaussian mixture model used in statistics [16]. In the regime considered here (dimension and number of samples tending to infinity), there are a number of recent works dealing with the clustering problem of Gaussian mixtures. However, a large part of them considers scenarios where or . In the regime where i.e. where the number of observations is proportional to the dimension, spectral clustering has been extensively studied. In this regime it is known that the leading eigenvector of the sample covariance matrix encounters a phase transition [1, 2, 30]: there exists a critical value of the noise intensity below which the leading eigenvector starts to be correlated with the centers of the clusters. Using exact but non-rigorous methods from statistical physics, [6, 24] determine the critical values for and at which it becomes information-theoretically possible to reconstruct the membership into clusters better than chance. Rigorous results on this model are given in [3] where bounds on the critical values are obtained. The precise thresholds were then determined in [27]. Our analysis builds on the techniques derived in this last reference with two main modifications: additional work is required to compute the classification accuracy (as opposed to the mean squared error) and to incorporate the side information.
To the best of our knowledge, there are much fewer theoretical works dealing with a semi-supervised learning in a high-dimensional setting. [9] shows the error converges exponentially fast in the number of labeled examples if the mixture model is identifiable (see [31] for an extension of these results). [10] studies a mixture model where the estimation problem is essentially reduced to the one of estimating the mixing parameter and shows that the information content of unlabeled examples decreases as classes overlap. [12] shows that unlabeled data can lead to an increase in classification error in a case where the model is incorrect. A similar conclusion is obtained in [21] for linear classifiers defined by convex margin-based sur-rogate losses. In contrast, our work computes the asymptotic Bayes risk for which unlabeled data can only improve the best achievable performance. More closely related to our work, [14] provides the first information theoretic tight analysis for inference of latent community structure given a dense graph along with high dimensional node covariates, correlated with the same latent communities. [25] studies a class of graph-oriented semi-supervised learning algorithms in the limit of large and numerous data similar to our setting.
In contrast, there are a number of practical works and proposed algorithms for semi-supervised learning based on transductive models [19], graph-based method [34] or generative modeling [7], see the surveys [35] and [11]. SSL methods based on training a neural network by adding an additional loss term to ensure consistency regularization are presented in [17], [20], [33]. We refer in particular to the recent work [28] for an overview of these SSL methods (currently the state-of-the-art for SSL on image classification datasets). The algorithm MixMAtch introduced in [8] obtains impressive results on all standard image benchmarks. Given these recent improvements, natural questions arise: what is the best possible achievable performance? to what extend can we generalize those improvement to other domains? We believe that our work is a first step in a theoretical understanding of these questions.
4 Heuristic derivation of the main result
We present now an heuristic derivation of our results, based on the “cavity method” [26] from statistical physics. Let and be the optimal estimators (in term of mean squared error) for estimating and . A natural hypothesis is to assume that the correlation converges as to some deterministic limit and that .
The conditional expectation is the orthogonal projection (in sense) of the random vector onto the subspace of -measurable random variables. The squared norm of the projection is equal to the scalar product of the vector with its projection : . Assuming that also admits a deterministic limit, this limits is then equal to . We get for large and , . Analogously we have .
We will show below that and obey some fixed point equations that allow to determine them.
As seen above, if we aim at estimating a label that we did not observe (i.e. ) given and the “oracle” , we compute the sufficient statistic . The estimator that minimizes the probability of error is simply . The one that minimizes the mean squared error (MSE) is which achieves a MSE of
[TABLE]
where we define for and (see Section 7.1 for more details):
[TABLE]
In the case where we do not have access to the oracle , one can still use as a proxy. We repeat the same procedure assuming that is a sufficient statistic for estimating . Although this is not strictly true, we shall see that this leads to the correct fixed point equations for . Compute
[TABLE]
The posterior mean is not expected to depend much on the particular point and therefore on . This gives that the random vectors and are approximately independent. Hence the distribution of is roughly (we recall that ). We get
[TABLE]
in law, where . The best estimator (in terms of MSE) one can then construct using achieves a MSE of
[TABLE]
We assumed that is a sufficient statistic for estimating , therefore . For all the indices such that we have obviously . Hence
[TABLE]
Since we have , we get
[TABLE]
We can do the same reasoning with instead of . We denote by (resp. ) the -th row of the matrix (resp. ), so that we have
[TABLE]
Hence taking the scalar product with gives
[TABLE]
in law, where . Recall that . Making the same assumption as above, the best estimator one can construct using achieves a MSE of
[TABLE]
where for . This leads to
[TABLE]
As shown in Section 7.1, we have
[TABLE]
We conclude that satisfies the following fixed point equations:
[TABLE]
We introduce the following mutual information
[TABLE]
where and are independent. An elementary computation leads to (see Section 7.1)
[TABLE]
By the “I-MMSE” Theorem from [18], is related to :
[TABLE]
Let us compute the derivative of defined by (4), using (11):
[TABLE]
Using (7)-(8), one verifies easily that . By Proposition 1 (proved in Section 7.3), admits a unique critical point on which is its unique minimizer: is therefore the minimizer of .
If we now want to estimate from and we assume, as above that is a sufficient statistic. As for (6), we have
[TABLE]
in law, where is independent of . The Bayes classifier is then
[TABLE]
hence
[TABLE]
which is the statement of our main Theorem 1 above.
5 Proof sketch
From now we simply write instead of . The next theorem computes the limit of the log-likelihood ratio.
Theorem 2**.**
Conditionally on ,
[TABLE]
Before sketching the proof of this theorem, we show how Theorem 1 follows from Theorem 2 above. The optimal estimator for is given by the sign of the log-likelihood ratio, hence we get:
[TABLE]
Proof.
Let us look at the posterior distribution of given , i.e. From Bayes rule we get
[TABLE]
Let . The following lemma is proved in the supplementary material, see Section 7.3.
Lemma 1**.**
Let be i.i.d. samples from the posterior distribution of given , independently of everything else. Then
[TABLE]
For we define
[TABLE]
Using Lemma 1, we prove the following lemma in Section 7.3
Lemma 2**.**
For , .
Since , we have by Cauchy-Schwarz inequality:
[TABLE]
using Lemma 2 (one can verify easily that the first term of the product above is ). We get , hence
[TABLE]
is independent of and by Lemma 1 we have . Consequently and we conclude:
[TABLE]
where is independent of . ∎
6 Conclusion
We analyzed a simple high-dimensional Gaussian mixture model in a semi-supervised setting and computed the associated Bayes risk. In our model, we are able to compute the best possible accuracy of semi-supervised learning using both labeled and unlabeled data as well as the best possible performances of supervised learning using only the labeled data and unsupervised learning using all data but without any label. This allows us to quantify the added value of unlabeled data. When the clusters are well separated (probably the most realistic setting), we find that the value of unlabeled data is dominating. Labeled data can almost be ignored as unsupervised learning achieved roughly the same performance as semi-supervised learning. Nevertheless, using a few labeled data is often very helpful in practice as shown by the recent MixMatch algorithm [8].
We believe our main Theorem 1 gives new insights for semi-supervised learning and we designed our model with a focus on simplicity. However, our proof technique is very general and can handle a much more complex model. For example, we can deal with classes of different sizes by changing the prior of . Another extension for which our proof carries over consists in modifying the channel for the side information. Here, we considered the erasure channel corresponding to the standard SSL setting but our proof will still work for other channel like the binary symmetric channel or the Z channel corresponding to a setting with noisy labels.
7 Supplementary material
7.1 Gaussian channel
We give here some easy computation for the Gaussian channel:
[TABLE]
where is independent of .
We first consider the case where . We define . Since, we are dealing with Gaussian random variables, is simply the orthogonal projection of on :
[TABLE]
Hence, we have
[TABLE]
Thanks to the I-MMSE relation [18], we have . For , and are independent: , so that we get
[TABLE]
We now consider the case where . We define . Recall that
[TABLE]
And here, we have
[TABLE]
Hence, we have
[TABLE]
Thanks to the I-MMSE relation, we have:
[TABLE]
so that we have .
7.2 Convergence of the mutual information
Theorem 3**.**
For all , ,
[TABLE]
Further, this minimum is achieved at a unique point and
[TABLE]
where is a sample from the posterior distribution of given , independently of everything else.
Proof.
The limit (13) was proved in [27] in the case . The proof can however be straightforwardly adapted to the case and leads to
[TABLE]
where . The supremum in can be easily computed, leading to:
[TABLE]
This proves (13). The fact that admits a unique minimizer comes from Proposition 1.
From the limit of the mutual information, one gets the limits of minimal mean squared errors (MMSE) using the “I-MMSE” relation [18]:
[TABLE]
Let be a sample from the posterior distribution of given , independently of everything else. Then we deduce
[TABLE]
In order to show that it remains to show that
[TABLE]
This can be done (as in [5]) by adding a small amount of additional side-information to the model of the form , where the entries of the tensor are i.i.d. standard Gaussian: . We then apply the I-MMSE relation with respect to to obtain (15). ∎
7.3 Technical lemmas
We now give the proof of Lemma 1
Proof.
Notice that, by Bayes rule, we have . So we have by (14) . Now, by Jensen’s inequality:
[TABLE]
Since and , this leads to
[TABLE]
∎
We now give a proof of Lemma 2.
Proof.
In order to prove that , it suffices to show that . Let be i.i.d. samples from the posterior distribution of given , independently of everything else. Using Lemma 1, we compute:
[TABLE]
Integrating with respect to only, we get
[TABLE]
where the last limit follows from Lemma 1. Following the same steps we compute:
[TABLE]
and
[TABLE]
The three limits above are the same, the Lemma is proved. ∎
Proposition 1**.**
For all and all , the function admits a unique critical point which is its unique minimizer on .
Proof.
Recall that , where . A computation gives . We define,
[TABLE]
where . Hence
[TABLE]
The critical points of are solution of
[TABLE]
As proved in [13, Lemma 6.1], the function is concave. This gives that is concave. Since is upper-bounded by and , we get that admits a unique fixed point on . The function admits therefore a unique critical point on which is necessarily a minimum since and ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Baik, G. B. Arous, S. Péché, et al. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability , 33(5):1643–1697, 2005.
- 2[2] J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of multivariate analysis , 97(6):1382–1408, 2006.
- 3[3] J. Banks, C. Moore, R. Vershynin, N. Verzelen, and J. Xu. Information-theoretic bounds and phase transitions in clustering, sparse pca, and submatrix localization. IEEE Transactions on Information Theory , 64(7):4872–4894, 2018.
- 4[4] J. Barbier, M. Dia, N. Macris, F. Krzakala, T. Lesieur, L. Zdeborová, et al. Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula. In Advances in Neural Information Processing Systems , pages 424–432, 2016.
- 5[5] J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborová. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences , 116(12):5451–5460, 2019.
- 6[6] N. Barkai and H. Sompolinsky. Statistical mechanics of the maximum-likelihood density estimation. Physical Review E , 50(3):1766, 1994.
- 7[7] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems , pages 585–591, 2002.
- 8[8] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel. Mixmatch: A holistic approach to semi-supervised learning. ar Xiv preprint ar Xiv:1905.02249 , 2019.
