Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models
Xinwei Zhang, Zhiqiang Tan

TL;DR
This paper introduces a semi-supervised logistic learning approach utilizing exponential tilt mixture models, enhancing classification accuracy by effectively leveraging both labeled and unlabeled data.
Contribution
It develops a novel semi-supervised logistic method based on exponential tilt models, with new objective functions, regularized estimation, and interpretable EM algorithms.
Findings
Proposed methods outperform existing semi-supervised classifiers.
Theoretical properties such as Fisher consistency are established.
Numerical experiments demonstrate improved prediction accuracy.
Abstract
Consider semi-supervised learning for classification, where both labeled and unlabeled data are available for training. The goal is to exploit both datasets to achieve higher prediction accuracy than just using labeled data alone. We develop a semi-supervised logistic learning method based on exponential tilt mixture models, by extending a statistical equivalence between logistic regression and exponential tilt modeling. We study maximum nonparametric likelihood estimation and derive novel objective functions which are shown to be Fisher consistent. We also propose regularized estimation and construct simple and highly interpretable EM algorithms. Finally, we present numerical results which demonstrate the advantage of the proposed methods compared with existing methods.
| Homo Prop | RLR | ER | pSLR | dSLR | SVM | TSVM |
| AUSTRA | 85.37 2.00 | 85.50 1.94 | 85.43 2.07 | 85.33 2.03 | 85.37 1.96 | 85.15 1.79 |
| BCW | 95.76 1.04 | 95.64 1.08 | 95.71 1.07 | 95.80 1.07 | 96.13 1.04 | 96.44 0.92 |
| GERMAN | 72.16 2.69 | 72.22 2.60 | 72.39 2.71 | 72.12 2.68 | 70.65 2.85 | 69.01 3.94 |
| HEART | 80.94 3.56 | 81.04 4.23 | 81.46 3.41 | 80.94 3.78 | 80.00 4.35 | 80.36 5.00 |
| INON | 84.38 2.29 | 84.17 2.16 | 85.00 1.21 | 84.38 2.29 | 83.92 2.51 | 83.33 3.18 |
| LIVER | 65.57 4.37 | 65.83 4.20 | 66.35 4.65 | 66.22 4.22 | 67.87 3.22 | 64.83 5.00 |
| PIMA | 74.71 2.99 | 75.06 3.07 | 75.02 2.90 | 74.71 3.04 | 74.65 2.69 | 72.58 3.97 |
| SPAM | 87.98 2.70 | 88.04 2.95 | 87.90 2.49 | 87.94 2.70 | 85.74 3.90 | 87.04 5.37 |
| VEHICLE | 93.10 2.73 | 92.59 2.80 | 92.45 2.83 | 93.24 2.82 | 92.38 3.31 | 93.03 3.39 |
| VOTES | 93.66 2.55 | 93.59 2.59 | 93.66 2.32 | 93.45 2.59 | 94.17 2.57 | 94.03 2.92 |
| WDBC | 95.92 1.65 | 95.61 1.70 | 95.89 1.48 | 95.92 1.65 | 95.67 1.55 | 96.06 1.31 |
| BCI | 66.50 4.06 | 65.83 3.80 | 65.86 4.89 | 65.86 4.40 | 68.46 5.01 | 67.48 5.20 |
| COIL | 78.95 3.15 | 78.96 3.24 | 79.07 3.89 | 78.70 3.42 | 80.102.53 | 81.39 2.38 |
| DIGIT1 | 89.90 1.11 | 89.29 2.70 | 90.00 1.18 | 89.87 1.16 | 89.301.33 | 89.73 1.45 |
| USPS | 85.39 2.38 | 85.62 2.26 | 85.97 1.91 | 85.54 2.45 | 85.502.13 | 84.71 2.27 |
| Average accuracy | 83.35 | 83.27 | 83.48 | 83.33 | 83.34 | 83.01 |
| # within 1% of highest | 12/15 | 12/15 | 12/15 | 12/15 | 10/15 | 7/15 |
| No | Data | # of obs | # of positive | # of negative | % of positive | feature dim |
|---|---|---|---|---|---|---|
| 1 | AUSTRA | 690 | 383 | 307 | 55.51 | 14 |
| 2 | BCW | 683 | 444 | 239 | 65.01 | 9 |
| 3 | GERMAN | 1000 | 700 | 300 | 70.00 | 24 |
| 4 | HEART | 297 | 137 | 160 | 46.13 | 13 |
| 5 | IONO | 331 | 126 | 225 | 38.07 | 34 |
| 6 | LIVER | 345 | 145 | 200 | 42.03 | 6 |
| 7 | PIMA | 768 | 500 | 268 | 65.10 | 8 |
| 8 | SPAM | 4601 | 2788 | 1813 | 60.60 | 57 |
| 9 | VEHICLE | 435 | 218 | 217 | 50.11 | 18 |
| 10 | VOTES | 435 | 257 | 168 | 59.08 | 16 |
| 11 | WDBC | 569 | 357 | 212 | 62.74 | 31 |
| 12 | BCI | 400 | 200 | 200 | 50.00 | 117 |
| 13 | COIL | 1500 | 750 | 750 | 50.00 | 241 |
| 14 | DIGIT1 | 1500 | 766 | 734 | 51.07 | 241 |
| 15 | USPS | 1500 | 1200 | 300 | 80.00 | 241 |
| Homo Prop | RLRa | ERa | pSLRa | dSLRa | SVMa |
| AUSTRA | 85.48 1.81 | 85.74 1.92 | 85.39 1.89 | 85.50 1.79 | 85.222.05 |
| BCW | 96.31 1.08 | 96.09 1.07 | 96.22 1.16 | 96.29 1.07 | 96.621.00 |
| GERMAN | 66.40 3.09 | 66.73 3.02 | 67.15 2.84 | 66.23 3.12 | 63.897.30 |
| HEART | 80.26 4.20 | 80.73 4.39 | 80.94 3.48 | 80.21 4.17 | 79.694.37 |
| INON | 82.08 2.74 | 83.79 3.00 | 81.88 3.22 | 82.08 2.74 | 80.673.73 |
| LIVER | 62.91 6.11 | 63.04 6.39 | 62.91 6.27 | 63.22 6.32 | 64.833.40 |
| PIMA | 72.97 3.72 | 73.01 3.51 | 72.93 3.71 | 72.77 3.85 | 72.402.55 |
| SPAM | 88.52 2.38 | 89.06 2.97 | 88.60 2.75 | 88.44 2.34 | 86.684.28 |
| VEHICLE | 93.10 2.73 | 92.59 2.80 | 92.45 2.83 | 93.24 2.82 | 92.453.60 |
| VOTES | 93.59 2.56 | 93.62 2.37 | 93.48 2.44 | 93.48 2.66 | 94.002.57 |
| WDBC | 95.28 1.73 | 95.58 1.70 | 95.58 1.48 | 95.19 1.78 | 95.752.11 |
| BCI | 66.50 4.06 | 65.83 3.80 | 65.86 4.89 | 65.86 4.40 | 69.105.10 |
| COIL | 78.95 3.15 | 78.96 3.24 | 79.07 3.89 | 78.70 3.42 | 80.222.62 |
| DIGIT1 | 89.77 1.11 | 89.32 2.67 | 90.07 1.16 | 89.88 1.10 | 89.471.40 |
| USPS | 81.39 4.31 | 81.91 3.99 | 81.06 3.95 | 81.68 4.31 | 80.633.88 |
| Average accuracy | 82.23 | 82.40 | 82.24 | 82.18 | 82.11 |
| # within 1% of highest | 11/15 | 12/15 | 11/15 | 11/15 | 10/15 |
| Homo Prop | RLR | ER | pSLR | dSLR | SVM |
|---|---|---|---|---|---|
| AUSTRA | 91.221.70 | 91.521.46 | 91.231.62 | 91.211.70 | 91.092.03 |
| BCW | 99.180.52 | 99.230.47 | 99.150.55 | 99.180.52 | 99.300.38 |
| GERMAN | 72.203.12 | 72.492.99 | 72.462.99 | 72.203.13 | 69.4410.22 |
| HEART | 89.033.06 | 88.963.34 | 88.963.13 | 89.053.06 | 87.903.67 |
| INON | 86.360.81 | 85.674.99 | 86.660.61 | 86.360.81 | 83.035.86 |
| LIVER | 67.986.07 | 67.906.11 | 68.246.07 | 68.376.05 | 70.912.58 |
| PIMA | 79.883.86 | 80.033.84 | 79.893.82 | 79.833.83 | 79.643.36 |
| SPAM | 93.832.11 | 94.262.07 | 93.582.24 | 93.792.14 | 92.363.21 |
| VEHICLE | 97.281.75 | 96.192.11 | 96.212.13 | 97.281.77 | 95.862.97 |
| VOTES | 98.101.13 | 98.031.08 | 98.071.14 | 98.051.17 | 97.971.25 |
| WDBC | 99.120.81 | 99.100.82 | 99.130.77 | 99.080.85 | 99.140.78 |
| BCI | 73.174.33 | 72.823.98 | 72.454.69 | 72.424.25 | 76.115.18 |
| COIL | 85.183.63 | 85.223.84 | 85.483.83 | 85.073.55 | 84.862.72 |
| DIGIT1 | 96.690.70 | 96.321.51 | 96.700.69 | 96.690.70 | 96.400.76 |
| USPS | 86.122.89 | 86.282.82 | 86.072.90 | 86.102.87 | 83.474.63 |
| Average AUC | 87.69 | 87.60 | 87.62 | 87.65 | 87.17 |
| # within 1% of highest | 13/15 | 12/15 | 12/15 | 13/15 | 9/15 |
| Homo Prop | RLR | ER | pSLR | dSLR | SVM | TSVM |
| AUSTRA | 82.37 2.47 | 81.80 2.56 | 81.76 3.05 | 81.35 3.62 | 80.573.87 | 80.46 4.84 |
| BCW | 94.84 2.08 | 94.76 2.72 | 95.82 1.65 | 94.62 2.11 | 94.202.22 | 96.67 0.81 |
| GERMAN | 69.50 2.22 | 69.76 2.36 | 69.83 2.10 | 69.47 2.40 | 68.034.02 | 63.89 5.09 |
| HEART | 79.48 3.90 | 79.01 4.20 | 78.80 3.74 | 79.27 3.97 | 78.544.44 | 76.72 4.22 |
| INON | 77.21 5.97 | 75.33 6.75 | 76.04 6.77 | 76.38 6.12 | 77.046.51 | 76.71 6.98 |
| LIVER | 57.70 6.40 | 56.91 5.20 | 56.78 5.29 | 57.70 6.25 | 60.176.63 | 58.04 8.68 |
| PIMA | 68.09 3.61 | 67.66 3.67 | 67.03 3.73 | 67.83 3.68 | 67.603.62 | 65.25 4.95 |
| SPAM | 82.76 3.35 | 83.10 3.67 | 83.74 2.57 | 82.68 3.28 | 81.283.77 | 85.40 2.81 |
| VEHICLE | 73.93 6.71 | 73.90 7.06 | 76.00 7.20 | 74.31 6.32 | 79.215.01 | 76.28 8.24 |
| VOTES | 92.10 3.68 | 91.34 3.63 | 91.93 3.64 | 91.90 3.63 | 91.623.36 | 92.07 3.28 |
| WDBC | 92.17 3.21 | 89.94 7.91 | 92.28 3.83 | 91.83 3.37 | 91.392.95 | 92.56 2.51 |
| BCI | 56.88 4.98 | 54.96 4.97 | 55.45 5.32 | 56.32 5.08 | 56.025.26 | 54.85 4.67 |
| COIL | 60.62 5.47 | 58.63 6.85 | 57.18 7.21 | 61.29 5.25 | 63.656.64 | 64.78 6.83 |
| DIGIT1 | 82.04 3.83 | 82.09 3.83 | 83.05 4.13 | 82.08 3.78 | 80.394.82 | 84.74 3.29 |
| USPS | 81.15 2.67 | 81.13 2.60 | 80.89 2.66 | 81.34 2.29 | 81.622.96 | 77.57 4.15 |
| Average accuracy | 76.72 | 76.02 | 76.44 | 76.56 | 76.76 | 76.40 |
| # within 1% of highest | 9/15 | 6/15 | 7/15 | 8/15 | 8/15 | 7/15 |
| Homo Prop | RLRa | ERa | pSLRa | dSLRa | SVMa |
| AUSTRA | 81.98 2.37 | 81.80 2.58 | 81.54 2.91 | 81.22 4.09 | 80.87 3.38 |
| BCW | 95.53 1.81 | 95.24 2.37 | 95.96 1.43 | 95.36 1.90 | 95.96 1.39 |
| GERMAN | 59.94 5.16 | 58.50 7.01 | 57.16 6.41 | 61.91 4.19 | 47.36 13.28 |
| HEART | 79.64 4.12 | 78.85 3.99 | 79.17 3.87 | 79.53 4.17 | 79.06 4.39 |
| INON | 78.67 5.44 | 77.08 6.41 | 75.79 5.73 | 77.29 6.08 | 69.17 17.61 |
| LIVER | 54.91 7.69 | 54.96 7.99 | 55.78 7.83 | 55.65 8.35 | 52.91 10.55 |
| PIMA | 65.51 4.41 | 64.79 5.17 | 65.43 5.06 | 66.04 4.35 | 56.05 16.86 |
| SPAM | 82.98 2.55 | 83.42 3.15 | 84.18 2.26 | 83.14 2.74 | 82.46 2.82 |
| VEHICLE | 74.00 6.49 | 73.97 6.56 | 76.10 6.69 | 74.86 5.84 | 76.55 11.76 |
| VOTES | 91.34 3.74 | 90.72 3.81 | 91.24 3.53 | 91.28 3.75 | 90.38 4.18 |
| WDBC | 92.78 3.05 | 89.92 8.05 | 92.58 4.26 | 92.19 3.37 | 91.83 2.97 |
| BCI | 56.99 4.25 | 55.49 4.89 | 55.79 5.39 | 56.32 4.63 | 50.68 8.15 |
| COIL | 60.39 5.80 | 59.83 6.48 | 58.36 7.09 | 60.49 5.98 | 58.28 12.80 |
| DIGIT1 | 82.26 3.95 | 82.23 3.82 | 83.07 4.24 | 82.08 3.86 | 80.97 3.71 |
| USPS | 75.46 7.80 | 74.05 9.72 | 70.75 8.30 | 76.22 6.98 | 58.90 22.80 |
| Average accuracy | 75.49 | 74.72 | 74.86 | 75.57 | 71.43 |
| # within 1% of highest | 12/15 | 8/15 | 10/15 | 12/15 | 5/15 |
| Homo Prop | RLR | ER | pSLR | dSLR | SVM |
| AUSTRA | 89.392.09 | 89.272.26 | 88.912.23 | 89.102.23 | 88.783.04 |
| BCW | 99.150.40 | 98.970.82 | 99.130.43 | 99.190.36 | 99.090.87 |
| GERMAN | 61.805.11 | 60.805.14 | 59.925.50 | 60.765.52 | 48.8315.78 |
| HEART | 86.563.57 | 86.483.64 | 86.383.66 | 86.553.52 | 86.604.60 |
| IONO | 78.526.41 | 78.036.92 | 75.299.21 | 75.549.25 | 69.8319.66 |
| LIVER | 58.479.90 | 57.6410.62 | 59.0610.52 | 59.3510.42 | 55.9314.49 |
| PIMA | 69.948.03 | 69.878.38 | 69.338.48 | 70.038.09 | 58.3122.00 |
| SPAM | 90.962.14 | 91.012.77 | 90.312.73 | 90.892.18 | 89.992.84 |
| VEHICLE | 79.246.76 | 79.506.94 | 81.456.18 | 80.716.46 | 80.6017.10 |
| VOTES | 96.882.07 | 97.091.51 | 96.961.89 | 96.912.03 | 97.031.79 |
| WDBC | 97.981.38 | 96.874.01 | 97.871.46 | 97.941.43 | 97.491.53 |
| BCI | 60.305.44 | 57.765.74 | 58.306.41 | 60.755.50 | 50.9810.58 |
| COIL | 64.295.79 | 63.115.74 | 62.405.88 | 64.615.45 | 60.3514.91 |
| DIGIT1 | 90.853.84 | 90.863.87 | 91.264.09 | 90.853.83 | 90.163.85 |
| USPS | 74.096.85 | 73.466.86 | 73.936.44 | 74.056.82 | 60.9923.77 |
| Average AUC | 79.89 | 79.38 | 79.37 | 79.82 | 75.66 |
| # within 1% of highest | 13/15 | 9/15 | 10/15 | 12/15 | 6/15 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Genetic and phenotypic traits in livestock · Statistical Methods and Inference
MethodsLogistic Regression
Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models
Xinwei Zhang & Zhiqiang Tan
Department of Statistics, Rutgers University, USA
[email protected], [email protected]
Abstract
Consider semi-supervised learning for classification, where both labeled and unlabeled data are available for training. The goal is to exploit both datasets to achieve higher prediction accuracy than just using labeled data alone. We develop a semi-supervised logistic learning method based on exponential tilt mixture models, by extending a statistical equivalence between logistic regression and exponential tilt modeling. We study maximum nonparametric likelihood estimation and derive novel objective functions which are shown to be Fisher consistent. We also propose regularized estimation and construct simple and highly interpretable EM algorithms. Finally, we present numerical results which demonstrate the advantage of the proposed methods compared with existing methods.
1 Introduction
Semi-supervised learning for classification involves exploiting a large amount of unlabeled data and a relatively small amount of labeled data to build better classifiers. This approach can potentially be used to achieve higher accuracy, with a limited budget for obtaining labeled data. Various methods have been proposed, including expectation-maximization (EM) algorithms, transductive support vector machines (SVMs), and regularized methods (e.g., Chapelle et al. 2006; Zhu 2008).
For supervised classification, there are a range of objective functions which are Fisher consistent in the following sense: optimization of the population, nonparametric version of a loss function leads to the true conditional probability function of labels given features as for the logistic loss, or to the Bayes classifier as for the hinge loss (Lin 2002; Bartlett et al. 2006). In contrast, a perplexing issue we notice for semi-supervised classification is that existing objective functions are in general not Fisher consistent, unless in the degenerate case where unlabeled data are ignored and only labeled data are used. Examples include the objective functions in transductive SVMs (Vapnik 1998; Joachims 1999) and various regularized methods (Grandvalet & Bengio 2005; Mann & McCallum 2007; Krishnapuram et al. 2005). The lack of Fisher consistency may contribute to unstable performances of existing semi-supervised classifiers (e.g., Li & Zhou 2015). Another restriction in existing methods is that the class proportions in labeled and unlabeled data are typically assumed to be the same.
We develop a semi-supervised extension of logistic regression based on exponential tilt mixture models (Qin 1999; Zou et al. 2002; Tan 2009), without restricting the class proportions in the unlabeled data to be the same as in the labeled data. The development is motivated by a statistical equivalence between logistic regression for the conditional probability of a label given features and exponential tilt modeling for the density ratio between the feature distributions within different labels (Anderson 1972; Prentice & Pyke 1979). Our work involves two main contributions: (i) we derive novel objective functions which are shown not only to be Fisher consistent but also lead to asymptotically more efficient estimation than based on labeled data only, and (ii) we propose regularized estimation and construct computationally and conceptually desirable EM algorithms. From numerical experiments, our methods achieve a substantial advantage over existing methods when the class proportions in unlabeled data differ from those in labeled data. A possible explanation is that while the class proportions in unlabeled data are estimated as unknown parameters in our methods, they are implicitly assumed to be the same as in labeled data for existing methods including transductive SVMs (Joachims 1999) and entropy regularization (Grandvalet & Bengio 2005).
A simple, informative example is provided in the Supplement (Section II) to highlight comparison between new and existing methods mentioned above.
2 Background: logistic regression and exponential tilt model
For supervised classification, the training data consist of a sample of , where and representing a feature vector and an associated label respectively. Consider a logistic regression model
[TABLE]
where is a coefficient vector associated with , and is an intercept, with superscript c indicating classification or conditional probability of given . The maximum likelihood estimator (MLE) is defined as a maximizer of the log (conditional) likelihood:
[TABLE]
In general, nonlinear functions of can be used in place of , and a penalty term can be incorporated into the log-likelihood such as the ridge penalty or the squared norm of a reproducing kernel Hilbert space of functions of . We discuss these issues later in Sections 3.3 and 6.
Interestingly, logistic regression on can be made equivalent to an exponential tilt model on (Anderson 1972; Prentice & Pyke 1979; Qin 1998). Denote by or the conditional distribution or respectively, and . By the Bayes rule, model (1) is equivalent to the exponential tilt model
[TABLE]
where denotes the density ratio between and with respect to a dominating measure, and . Model (3) is explicitly a semi-parametric model, where is an infinitely-dimensional parameter and are finitely-dimensional parameters. In fact, logistic model (1) is also semi-parametric, where the marginal distribution of is an infinitely-dimensional parameter, and are finitely-dimensional parameters. Furthermore, the MLE in model (1) can be related to the following estimator in model (3) by the method of nonparametric likelihood (Kiefer & Wolfowitz 1956) or empirical likelihood (Owen 2001). Formally, are defined as a maximizer of the log-likelihood,
[TABLE]
over all possible such that is a probability measure supported on the pooled data with . Analytically, it can be shown that , , where . See Qin (1998) and references therein.
By the foregoing discussion, we see that there are two statistically distinct but equivalent approaches for supervised classification: logistic regression or exponential tilt models. It is such a relationship that we aim to exploit in developing a new method for semi-supervised classification.
3 Theory and methods
For semi-supervised classification, the training data consist of a labeled sample and an unlabeled sample , for which the associated labels are unobserved. Typically for existing methods including transductive SVMs, the two samples and are assumed to be from a common population of . However, we allow that and may be drawn from different populations, with the same conditional distribution , but possibly different marginal probabilities and .
3.1 Exponential tilt mixture model
Although it seems difficult at first look to extend logistic model (1) for semi-supervised learning, we realize that both the labeled sample and the unlabeled sample can be taken account of by a natural extension of the exponential tilt model (3), called an exponential tilt mixture (ETM) model (Qin 1999; Zou et al. 2002; Tan 2009). Denote
[TABLE]
An exponential tilt mixture model for the three samples postulates that
[TABLE]
where or represents the conditional distribution of given or respectively in both the labeled and unlabeled data such that
[TABLE]
and is the proportion of underlying the unlabeled data. While Eqs (5)–(6) merely give definitions of and , Eq (7) says that the feature distribution in the unlabeled sample is a mixture of and , which follows from the structural assumption that the conditional distribution is invariant between the labeled and unlabeled samples. Eq (8) imposes a functional restriction on the density ratio between and , similarly as in (3).
The ETM model, defined by (5)–(8), is a semi-parametric model, with an infinitely-dimensional parameter and finitely-dimensional parameter and . We briefly summarize maximum nonparametric likelihood estimation previously studied (Qin 1999; Zou et al. 2002; Tan 2009). For notational convenience, rewrite the sample as , where , , and . Eqs (5)–(7) can be expressed as
[TABLE]
where , , and . For any fixed , the average profile log-likelihood of is defined as with
[TABLE]
over all possible which is a probability measure supported on the pooled data with . Denote
[TABLE]
which can be easily shown to be concave in and convex in . Then Proposition 1 in Tan (2009) leads to the following result.
Lemma 1**.**
The average profile log-likelihood of can be determined as , where is a minimizer of over , satisfying the stationary condition (free of )
[TABLE]
The maximum likelihood estimator of is then defined by maximizing the profile log-likelihood, that is, . From Lemma 1, we notice that the estimators jointly solve the saddle-point problem:
[TABLE]
Large sample theory of has been studied in Qin (1999) under standard regularity conditions as and with some constant for . The theory shows the existence of a local maximizer of , which is consistent and asymptotically normal provided the ETM model (5)–(8) is correctly specified. However, there remain subtle questions. It seems unclear whether the population version of the average profile log-likelihood attains a global maximum at the true values of under a correctly specified ETM model. Moreover, what property can be deduced for under a misspecified ETM model?
3.2 Semi-supervised logistic regression
We derive a new classification model with parameters for the three samples such that an MLE of in the new model coincides with an MLE in the ETM model, and vice versa. Let if and if . Consider a conditional probability model for predicting the label from :
[TABLE]
where , which ensures that . The model, defined by (12)–(14), will be called a semi-supervised logistic regression (SLR) model. The average log-likelihood function of with the data in model (12)–(14) can be written, up to an additive constant free of , as
[TABLE]
Proposition 1**.**
If and only if is a local (or respectively global) maximizer of the average log-likelihood in SLR model (12)–(14), then it is a local (or global) maximizer of the average profile log-likelihood in ETM model (5)–(8).
Proposition 1 shows an equivalence between maximum nonparametric likelihood estimation in ETM model (5)–(8) and usual maximum likelihood estimation in SLR model (12)–(14), even though the objective functions and are not equivalent. This differs from the equivalence between logistic regression (1) and exponential tilt model (3) with labeled data only, where the log-likelihood (2) and the profile log-likelihood from (4) are equivalent (Prentice & Pyke 1979). From another angle, this result says that saddle-point problem (11) can be equivalently solved by directly maximizing . This transformation is nontrivial, because a saddle-point problem in general cannot be converted into optimization with a closed-form objective.
By the identification of as a usual log-likelihood function, we show that the objective functions and , with the linear predictor replaced by an arbitrary function , are Fisher consistent nonparametrically, i.e., maximization of their population versions leads to the true values. This seems to be the first time Fisher consistency of a loss function is established for semi-supervised classification. By some abuse of notation, denote
[TABLE]
Proposition 2**.**
Suppose that is drawn from in (5)–(7) for , with and for some fixed value and function . Denote . For any and function , we have
[TABLE]
where both equalities hold if and . Hence the population objective functions and are maximized at the true value and function .
Proposition 2 fills existing gaps in understanding maximum likelihood estimation in ETM model (5)–(8), through its equivalence with that in SLR model (12)–(14). If the ETM model is correctly specified, then the population version of has a global maximum at the true values of , and hence a global maximizer is consistent under suitable regularity conditions. If the ETM model is misspecified, then by theory of estimation with misspecified models (Manski 1988; White 1982), the MLE converges in probability to a limit value which minimizes the difference between with and . This difference as shown in the Supplement (Section IV.2) is the expected Kullback–Leibler divergence
[TABLE]
where is the conditional probability (12)–(14) for , is the Kullback–Leibler divergence between two probability vectors and , and denotes the expectation with respect to .
Finally, we point out another interesting property of SLR model (12)–(14). If is fixed as , the proportion of in the labeled sample, then . In this case, the conditional probability (14) reduces to a constant, and the objective function can be easily shown to be equivalent to the profile log-likelihood of derived from (4) in the exponential tilt model based on the labeled data only or equivalently the log-likelihood of as (2) from logistic regression based on the labeled data only, after the intercept shift . We show that the MLE from ETM model (5)–(8) or equivalently SLR model (12)–(14) is asymptotically more efficient than that from logistic regression based on the labeled data only.
Proposition 3**.**
Denote by the estimator of obtained by maximizing or equivalently by logistic regression based on the labeled data only. Then the asymptotic variance matrix of the MLE from ETM model (5)–(8) is no greater (in the usual order on positive-definite matrices) than that of under standard regularity conditions.
3.3 Regularized estimation and EM algorithm
The results in Section 3.2 provide theoretical support for the use of the objective functions and . In real applications, the MLE may not behave satisfactorily as predicted by standard asymptotic theory for various reasons. The labeled sample size may not be sufficiently large. The dimension of the feature vector or the complexity of functions of features may be too high, compared with the labeled and unlabel data sizes. Therefore, we propose regularized estimation by adding suitable penalties to the objective functions.
For the coefficient vector , we employ a ridge penalty , although alternative penalties can also be allowed including a Lasso penalty. For the mixture proportion , we use a penalty in the form of the log density of a Beta distribution, , where and for a “center” and a “scale” . This choice is motivated by conceptual and computational simplicity in the EM algorithm to be discussed. Combining these penalties with gives the following penalized objective function
[TABLE]
Similarly, the penalized objective function based on is
[TABLE]
Maximization of (15) or (16) will be called profile or direct SLR respectively. The two methods in general lead to different estimates of when , although they can be shown to be equivalent similarly as in Proposition 1 when . In fact, as (i.e., is fixed as ), the estimator of from profile SLR is known to asymptotically more efficient than from direct SLR (Tan 2009).
We construct EM algorithms (Dempster et al. 1977) to numerically maximize (15) and (16). Of particular interest is that these algorithms shed light on the effect of the regularization introduced. Various other optimization techniques can also be exploited, because is directly of a closed form, and is defined only after univariate minimization in .
We describe some details about the EM algorithm for profile SLR. See the Supplement (Section III) for the corresponding algorithm for direct SLR. We return to the nonparametric log-likelihood (9) and introduce the following data augmentation. For , let such that and . Recall that and and hence and fixed. Denote the penalty term in (15) or (16) as .
E-step. The expectation of the augmented objective given the current estimates is
[TABLE]
where .
M-step. The next estimates are obtained as a maximizer of the expected objective (17) with profiled out, that is, over all possible which is a probability measure supported on the pooled data with . In correspondence to , denote
[TABLE]
Instead of maximizing directly, we find a simple scheme for computing .
Proposition 4**.**
Let
[TABLE]
If and only if is a local (or global) maximizer of , then is a local (or respectively global) maximizer of .
Proposition 4 is useful both computationally and conceptually. First, is of a closed form, as a weighted average, with the weight depending on the scale , between the prior center and the empirical estimate , which would be obtained with or respectively. Moreover, can be equivalently computed by maximizing the objective function
[TABLE]
which is concave in and of a similar form to the log-likelihood (2) with a ridge penalty for logistic regression. Each imputed probability serves as a pseudo response.
In our implementation, the prior center is fixed as , the proportion of in the labeled sample, and the scales are treated as tuning parameters, to be selected by cross validation. Numerically, this procedure allows an adaptive interpolation between the two extremes: a fixed choice or an empirical estimate by maximum likelihood. For direct SLR (but not profile SLR), our adaptive procedure reduces to and hence accommodates logistic regression with labeled data only at one extreme with . See the Supplement (Section III) for further discussion.
4 Related work
There is a vast literature on semi-supervised learning. See, for example, Chapelle et al. (2006) and Zhu (2008). For space limitation, we only discuss directly related work to ours.
Generative models and EM. A generative model can be postulated for jointly such that , where denotes the label proportion and denotes the parameters associated with the feature distributions given labels (e.g., Nigam et al. 2000). In our notation, a generative model corresponds to Eqs (5)–(7), but with both and parametrically specified. For training by EM algorithms, the expected objective in the E-step is similar to in (17), except that is replaced by for or 1. The performance of generative modeling can be sensitive to whether the model assumptions are correct or not (Cozman et al. 2003). In this regard, our approach based on ETM models is attractive in only specifying a parametric form (8) for the density ratio between and while leaving the distribution nonparametric.
Logistic regression and EM. There are various efforts to extend logistic regression in an EM-style for semi-supervised learning. Notably, Amini & Gallinari (2002) proposed a classification EM algorithm using logistic regression (1), which can be described as follows:
- •
E-step: Compute . Fix and .
- •
C-step: Let if and 0 otherwise. Fix and .
- •
M-step: Compute by maximizing the objective .
Although convergence of classification EM was studied for clustering (Celeux & Govaert 1992), it seems unclear what objective function is optimized by the preceding algorithm. A worrisome phenomenon we notice is that if soft classification is used instead of hard classification, then the algorithm merely optimizes the log-likelihood of logistic regression with the labeled data only. By comparing (19) and (20), this modified algorithm can be shown to reduce to our EM algorithm with and clamped at , the proportion of in the labeled sample.
Proposition 5**.**
If the objective in the M-step is modified with replaced by as
[TABLE]
then converges as to MLE of logistic regression based on the labeled data only.
We notice that the conclusion also holds if (20) is replaced by the cost function proposed in Wang et al. (2009), Eq (2), when the logistic loss is used as the cost function on labeled data.
Regularized methods. Various methods have been proposed by introducing a regularizer depending on unlabeled data to the log-likelihood of logistic regression with labeled data. Examples include entropy regularization (Grandvalet & Bengio 2005), expectation regularization (Mann & McCallum 2007), and graph-based priors (Krishnapuram et al. 2005). An important difference from our methods is that these penalized objective functions seem to be Fisher consistent only when they reduce to the log-likelihood of logistic regression with labeled data only alone, regardless of unlabeled data. For another difference, the class proportions in unlabeled data are implicitly assumed to be the same as in labeled data in entropy regularization, and need to be explicitly estimated from labeled data or external knowledge in the case of label regularization (Mann & McCallum 2007).
5 Numerical experiments
We report experiments on 15 benchmark datasets including 11 UCI datasets and 4 SSL benchmark datasets. We compare our methods, profile SLR (pSLR) and direct SLR (dSLR), with 2 supervised methods, ridge logistic regression (RLR) and SVM, and 2 semi-supervised methods, entropy regularization (ER) (Grandvalet & Bengio 2005) and transductive SVM (TSVM) (Joachims 1999). For each method, only linear predictors are studied. All tuning parameters are selected by 5-fold cross validation. See the Supplement (Section V) for details about the datasets and implementations.
For each dataset except SPAM, a training set is obtained as follows: labeled data are sampled for a certain size (25 or 100) and fixed class proportions and then unlabeled data are sampled such that the labeled and unlabeled data combined are 2/3 of the original dataset. The remaining 1/3 of the dataset is used as a test set. For SPAM, the preceding procedure is applied to a subsample of size 750 from the original dataset. To allow different class proportions between labeled and unlabeled data, we consider two schemes: the class proportions in the labeled data are close to those of the original dataset (“Homo Prop”), or larger (or smaller) than the latter by an odds ratio of 4 (“Flip Prop”) if the odds of positive versus negative labels is (or respectively ) in the original dataset. Hence the class balance constraint as used in TSVM is misspecified in the second scheme.
Care is needed to define classifiers on test data. In the Homo Prop scheme, the 4 existing methods are applied as usual, and accordingly the classifiers from our methods are the sign of , where are the class sizes in the labeled training data. In the Flip Prop scheme, the classifiers from RLR, LR, and SVM are the sign of , and those from our methods are the sign of . Hence the intercepts of linear predictors are adjusted by assuming 1:1 class proportions in the test data. This assumption is often invalid in our experiments, but seems neutral when the actual class proportions in test data are unknown. The “linear predictor” is converted by logit from class probabilities for SVM, but this is currently unavailable for TSVM. Alternatively, class weights can be used in SVM, but this technique has not been developed for TSVM.
Table 1 presents the results with labeled data size 100. See the Supplement for those with labeled data size 25 and AUC results. In the Homo Prop scheme, the logistic-type methods, RLR, ER, pSLR, and dSLR, perform similarly to each other, and noticeably better than SVM and TSVM in terms of accuracy achieved within 1% of the highest (in bold). While unstable performances of SVM and TSVM have been previously noticed (e.g., Li & Zhou 2015), such good performances of RLR and ER on these benchmark datasets appear not to have been reported before. In the Flip Prop scheme, our methods, dSLR and pSLR, achieve the best two performances, sometimes with considerable margins of improvement over other methods. In this case, all methods except TSVM are applied with intercept adjustment as described above. Because which proportion scheme holds may be unknown in practice, the results with intercept adjustment in the Homo Prop scheme are reported in the Supplement. Our methods remain to achieve close to the best performance among the methods studied.
6 Conclusion
We develop an extension of logistic regression for semi-supervised learning, with strong support from statistical theory, algorithms, and numerical results. There are various questions of interest for future work. Our approach can be readily extended by employing nonlinear predictors such as kernel representations or neural networks. Further experiments with such extensions are desired, as well as applications to more complex text and image classification.
References
Amini, M.R. & Gallinari, P. (2002) Semi-supervised logistic regression. Proceedings of the 15th European Conference on Artificial Intelligence, 390–394.
Bartlett, P., Jordan, M., & McAuliffe, J. (2006) Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101, 138–156.
Celeux, G. & Govaert, G. (1992) A classification EM algorithm and two stochastic versions. Computational Statistics and Data Analysis, 14, 315–332.
Chapelle, O., Zien, A. & Schölkopf, B. (2006) Semi-Supervised Learning. MIT Press.
Cozman, F., Cohen, I. & Cirelo, M. (2003). Semi-supervised learning of mixture models. Proceedings of the 20th International Conference on Machine Learning, 99–106.
Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–22.
Grandvalet, Y., & Bengio, Y. (2005) Semi-supervised learning by entropy minimization. Advances in Neural Information Processing Systems 17, 529–536.
Joachims, T. (1999) Transductive inference for text classification using support vector machines. Proceedings of the 16th International Conference on Machine Learning, 200–209.
Kiefer, J. & Wolfowitz, J. (1956) Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Statistics, 27, 887–906.
Krishnapuram, B., Williams, D., Xue, Y., Carin, L., Figueiredo, M. & Hartemink, A.J. (2005) On semi-supervised classification. Advances in Neural Information Processing Systems 17, 721–728.
Li, Y.-F. & Zhou, Z.-H. (2015) Towards making unlabeled data never hurt. IEEE Transactions on Pattern analysis and Machine Intelligence, 37, 175–188.
Lin, Y. (2002) Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery, 6, 259–275.
Mann, G.S. & McCallum, A. (2007) Simple, robust, scalable semi-supervised learning via expectation regularization. Proceedings of the 24th International Conference on Machine learning, 593–600.
Manski, C.F. (1988) Analog Estimation Methods in Econometrics, Chapman & Hall.
Nigam, K., McCallum, A.K., Thrun, S. & Mitchell, T. (2000) Text classification from labeled and unlabeled documents using EM. Machine learning, 39, 103–134.
Owen, A.B. (2001) Empirical Likelihood. Chapman & Hall/CRC.
Prentice, R.L. & Pyke, R. (1979) Logistic disease incidence models and case-control studies. Biometrika, 66, 403–411.
Qin, J. (1998) Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85, 619–630.
Qin, J. (1999) Empirical likelihood ratio based confidence intervals for mixture proportions. Annals of Statistics, 27, 1368–1384.
Tan, Z. (2009) A note on profile likelihood for exponential tilt mixture models. Biometrika, 96, 229–236.
Wang, J., Shen, X. & Pan, W. (2009) On efficient large margin semisupervised learning: Method and theory. Journal of Machine Learning Research, 10, 719–742.
Vapnik, V. (1998) Statistical Learning Theory. Wiley-Interscience.
White, H. (1982) Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25.
Zhu, X.J. (2008) Semi-supervised learning literature survey. Technical Report, University of Wisconsin-Madison, Department of Computer Sciences.
Zou, F., Fine, J.P. & Yandell, B.S. (2002) On empirical likelihood for a semiparametric mixture model. Biometrika, 89, 61–75.
**Supplementary Material for
“Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models”**
[TABLE]
I Introduction
We provide additional material to support the content of the paper. All equation and proposition numbers referred to are from the paper, except S1, S2, etc.
II Illustration
We provide a simple example to highlight comparison between new and existing methods. A labeled sample of size 100 is drawn, where 20 are from bivariate Gaussian, , with mean and diagonal variance matrix , and 80 are from bivariate Gaussian, , with mean and diagonal variance matrix . An unlabeled sample of size 1000 is drawn, where are from and from and then the labels are removed. This is similar to the Flip Prop scheme in numerical experiments in Section 5, where the class proportions in unlabeled data differ from those in labeled data. The training set including both labeled and unlabeled data is then rescaled such that the root mean square of each feature is 1, as shown in Figure S1.
Figure S2 (rows 1 to 5) shows the decision lines from, respectively, ridge linear regression (RLR), entropy regularization (ER), SVM, TSVM, and direct SLR (dSLR). In the left column are the decision lines without intercept adjustment (corresponding to an assumption of 1:4 class proportions in test data as in labeled training data), and in the right column are those with intercept adjustment (corresponding to an assumption of 1:1 class proportions in test data as in unlabeled training data), as described in Section 5. In practice, the class proportions in test data may be unknown and hence some assumption is needed.111The assumption of 1:1 class proportions in test data is used to define classifiers in the Flip Prop scheme in Section 5, even though this assumption is violated for a majority of datasets studied (see Table S1). For ease of comparison, the intercept adjustment is directly applied to for SVM and TSVM, instead of the “linear predictor” converted by logit from class probabilities (if available), which would yield a nonlinear decision boundary. Alternatively, class weights can be used in SVM to account for differences in class proportions between training and test data. But this technique has not been developed for TSVM.
For each method, eight decision lines (black or blue) are plotted, by using 8 values of a tuning parameter. Some of the lines may fall outside the plot region. The blue lines correspond to the least amount of penalization used, that is, smallest , and and largest . See Section V later for a description of the tuning parameters involved. For RLR, is varied uniformly from . For ER, is varied uniformly from from 1, while is fixed at 0, to isolate the effect of entropy regularization. For SVM and TSVM, is varied uniformly from . For TSVM, the parameter is automatically tuned when using SVM (Joachims 1999). For dSLR, is varied uniformly from , while is fixed at 0.
Two oracle lines are drawn in each plot. The red line is computed by logistic regression and the purple line is computed by SVM with , from an independent labeled sample of size 4000 with 1:4 class proportions (left column) or 1:1 class proportions (right column), which is transformed by the same scale as the original training set. The red and purple oracle lines differ only slightly in the left column, but are virtually identical in the right column. It should be noted that these oracle lines are not the optimal, Bayes decision boundary, because the log density ratio between the classes is linear in but nonlinear in due to the different variances of .
From these plots, we see the following comparison. First, the least penalized line (blue) from our method dSLR is much closer to the oracle lines (red and purple) than those from the other methods, whether or not intercept adjustment is applied. This shows numerical support for Fisher consistency of our method, given the labeled size 100 and unlabeled size 1000 reasonably large compared with the feature dimension 2. On the other hand, in spite of the relatively large labeled size, the lines from non-penalized logistic regression and SVM based on labeled data alone still differ noticeably from the oracle lines. Hence this also shows that our method can exploit unlabeled data together with labeled data to actually achieve a better approximation to the oracle lines.
Second, with suitable choices of tuning parameters, some of the decision lines from existing methods can be reasonably close to the oracle lines. In fact, such cases of good approximation can be found from the supervised methods RLR and SVM, but not from the semi-supervised methods ER and TSVM. This indicates potentially unstable performances of ER and TSVM, particularly in the current setting where the class proportions in unlabeled data differ from those in labeled data. Moreover, SVM seems to perform noticeably worse in the right column, possibly due to intercept adjustment, than in the left column, where the class proportions in test data underlying the oracle lines are identical those in labeled training data (hence a more favorable setting).
III EM algorithm for direct SLR
We present an EM algorithm to numerically maximizer (16) for direct SLR, based on the SLR model defined by (12)–(14). We introduce the following data augmentation. Given the pooled data , let
[TABLE]
Equivalently, can be denoted as , such that for . Similarly as in Section 3.3, and fixed, because and .
E-step. The expectation of the average penalized log-likelihood from the augmented data, given the current estimates is, up to an additive constant free of ,
[TABLE]
where .
M-step. The next estimates are obtained as a maximizer of the expected objective . Recall that defined in Section 3.3 is
[TABLE]
It directly follows that up to an additive constant. Therefore, the expected objective is related to in the profile method, in a similar manner as the average log-likelihood in the SLR model is related to the average profile log-likelihood in the ETM model before data augmentation.
Unfortunately, when is penalized with , there is no simple, closed-form expression for computing as in Proposition 4. Nevertheless, we show that can be obtained as a solution to a simple equation, independently of .
Proposition S1**.**
The estimate satisfies
[TABLE]
where because for any as shown in the proof of Proposition 1.
The formula (S2) shows that implicitly remains a weighted average of the prior center and the empirical estimate , with the weight depending on . If , then reduces to and hence the EM iterations coincide with those for profile SLR in Section 3.3. If , then becomes fixed at and then converges to a maximizer of , the ridge estimator of in the SLR model (12)–(14) with fixed. When is set to , this estimator of is identical to that from ridge logistic regression with labeled data only, except for an intercept shift.
In contrast, if is fixed in the EM algorithm for profile SLR, then converges to a maximizer of , the ridge estimator of in the ETM model (5)–(8).
IV Technical details
IV.1 Proof of Proposition 1
By some abuse of notation, denote as . Let be a fixed open set of . First, suppose that is a maximizer of over . Then for any . Denote . To prove that is a maximizer of over , we show that is a minimizer of , which then implies that achieves a maximum value at . Because is convex in , it suffices to show , where
[TABLE]
Because is a maximizer of , the stationary condition in yields
[TABLE]
where , , and . Eq (S4) is equivalent to
[TABLE]
Summing Eq (S3) multiplied by and Eq (S5) gives
[TABLE]
or equivalently
[TABLE]
Because and is concave in , Jensen’s inequality implies that . The inequality holds strictly, , because . Hence .
Next suppose that is a maximizer of over . Denote . We show that . Because is a solution to the saddle-point problem (11), the stationary condition in or equivalently Eq (10) gives
[TABLE]
The stationary condition in yields
[TABLE]
where , , and . Eq (S7) implies
[TABLE]
Eq (S8) is equivalent to
[TABLE]
Combining Eq (S6) multiplied by and summing Eq (S9) over and Eq (S10) shows , that is, . Let be a maximizer of over . The preceding proof shows that achieves the same maximum value as does over . Hence as the maximum value of is also the maximum value of over . Because and , this shows that is a maximizer of over .
IV.2 Proof of Proposition 2
Denote
[TABLE]
First, we show for any and . By direct calculation, notice that up to an additive constant,
[TABLE]
where , , and . Hence
[TABLE]
where is the Kullback–Leibler (KL) divergence between two probability vectors and .
Next we show that , that is, for any . By direct calculation, we obtain
[TABLE]
where the left hand side is the KL divergence between two probability distributions and .
IV.3 Proof of Proposition 3
By definition, is a maximizer of . By abuse of notation, rewrite as and hence as . Denote the log-likelihood, after rescaling, for logistic regression based on labeled data only as
[TABLE]
where and as before. The goal is to compare the asymptotic efficiency of and . We use Lemmas S1–S3 presented later in the subsection.
For notational simplicity, assume that is fixed as a constant as . The results can also be extended to the case where tends to a constant , as in previous asymptotic analysis (Qin 1999). Unless otherwise stated, are evaluated at the true values , and is evaluated at , where , , and .
By Lemma S2, it suffices to show
[TABLE]
where , , and are from Lemma S2. For in Lemma S2, the inequality
[TABLE]
implies
[TABLE]
Substituting the result of Lemma S3 into Eq (S12) yields Eq (S11).
In the following, we present the three lemmas used above. See Sections IV.4–IV.6 for proofs. Denote by the function . As above, are evaluated at .
Lemma S1**.**
(i) As , we have
[TABLE]
in probability, where
[TABLE]
(ii) Denote . As , converges to multivariate normal with mean 0 and variance matrix
[TABLE]
Lemma S2**.**
(i) Under standard regularity conditions, converges in distribution to , with , where
[TABLE]
(ii) Under standard regularity conditions, converges in distribution to , with
[TABLE]
where
[TABLE]
Lemma S3**.**
The inner product of and equals , i.e.,
[TABLE]
IV.4 Proof of Lemma S1
(i) We give the calculation of as an example. The remaining elements in can be calculated in a similar way. First, direct calculation yields
[TABLE]
Because are independent and identically drawn from
[TABLE]
we obtain
[TABLE]
where the simplification in the second term on the right hand side uses
[TABLE]
(ii) For a vector , denote . We show the derivations of and as examples and the remaining elements in can be derived similarly. First, we calculate as
[TABLE]
where
[TABLE]
and
[TABLE]
Hence . Second, we calculate as
[TABLE]
where
[TABLE]
and
[TABLE]
Hence .
IV.5 Proof of Lemma S2
(i) Note that with and satisfying and . By implicit differentiation, the gradient and Hessian of are
[TABLE]
where is treated as with and .
We use similar arguments as in the proof of Proposition 2 in Tan (2009). Write as a block matrix
[TABLE]
where is the right-bottom diagonal matrix with diagonal elements and . By the asymptotic theory of M-estimators, the equation admits a solution with . More specifically,
[TABLE]
By a Taylor expansion of ( in Eq (S13), with around , we find
[TABLE]
By the law of large numbers, and converge in probability to and respectively as . With , we have . Then, as , converges to multivariate normal with mean zero and variance matrix by Lemma S1(ii),
[TABLE]
The simplification follows because and
[TABLE]
Moreover, by Lemma S1(i) and Eq (S14), converges in probability as to , which is identical to . Hence converges in distribution to .
(ii) The result follows from the sandwich variance for M-estimation and direct calculation.
IV.6 Proof of Lemma S3
By Lemma S2, we have
[TABLE]
where the second equality holds because is based on labeled data and only and hence independent of
[TABLE]
It suffices to show that the two inner products on the right-hand side of Eq (S15) are
[TABLE]
The calculation proceeds in a similar way as in the proof of Lemma S1. Because , , and all have means [math] and are independent and identically drawn from , we have
[TABLE]
where
[TABLE]
For the first inner product, we calculate
[TABLE]
For the second inner product, we calculate
[TABLE]
Putting the foregoing results together, we obtain Eqs (S16) and (S17).
IV.7 Proof of Proposition 4
Similarly as in Proposition 1 in Tan (2009) or Lemma 1, it can be shown by Jensen’s inequality that
[TABLE]
where is a minimizer of over , satisfying Eq (10). Then is a maximizer of over , independently of , by direct calculation of the gradient. Hence it suffices to show that if and only if is a local (or global) maximizer of , then it is a local (or respectively global) maximizer of .
By some abuse of notation, denote as . Let be a fixed open set of . Suppose that is a maximizer of over . Then for any . To prove is a maximizer of over , we show that is a minimizer of , which then implies that achieves a maximum value at . Because is a maximizer of , the stationary condition in yields
[TABLE]
Combined with the definition of in (18), this shows that satisfies
[TABLE]
which is the stationary condition for minimization of , convex in .
Next suppose that is a maximizer of over . Then is a solution to the saddle-point problem, . The stationary condition in gives
[TABLE]
The stationary condition in yields
[TABLE]
These two equations together imply that . Then the stationary condition for to be a saddle point of gives
[TABLE]
Because is concave in as mentioned in Section 3.3, this implies that is a maximizer of over .
IV.8 Proof of Proposition 5
By construction, the estimate satisfies the stationary condition for maximization of (20):
[TABLE]
Let be the limit of the sequence as . Then satisfies
[TABLE]
because for , or if . This is precisely the score equation for the MLE of in logistic regression based on the labeled data only.
IV.9 Proof of Proposition S1
Rewrite as and as . Denote and
[TABLE]
Suppose is the maximizer of . The stationary conditions in gives
[TABLE]
where and . Taking a difference between Eq (S18) multiplied by and Eq (S19) yields
[TABLE]
In addition, Eq (S19) is equivalent to
[TABLE]
Combining Eq (S20), Eq (S21) and the definitions of and leads to Eq (S2).
V Experiment details
The 11 UCI datasets are available from https://archive.ics.uci.edu/ml/datasets.php and the 4 SSL benchmark data sets are from http://olivier.chapelle.cc/ssl-book/benchmarks.html. Table S1 gives the statistics of the datasets.
Each dataset is randomly divided into training and test data as described in Section 5. For the training set including labeled and unlabeled data, each feature is standardized to have mean 0 and variance 1. No further standardization is performed during cross validation.
The methods RLR, ER, SVM, and TSVM are implemented using the following computer packages respectively:
- •
glmnet, https://cran.r-project.org/web/packages/glmnet/index.html,
- •
RSSL, https://cran.r-project.org/web/packages/RSSL/index.html,
- •
libsvm, https://www.csie.ntu.edu.tw/~cjlin/libsvm/, and
- •
SVM, http://svmlight.joachims.org/.
Our methods, pSLR and dSLR, are implemented using R. The codes are available from the authors upon request.
For each method, the tuning parameters are selected by 5-fold cross validation over 8 possible values as follows. The search range for each tuning parameter is determined from exploratory experiments.
- •
RLR: The objective function for RLR is , where is the negative likelihood function for logistic regression on labeled data. Possible values for the log ridge parameter are fixed uniformly from for UCI datasets and from for SSL benchmark datasets.
- •
ER: The objective function for ER is where is the same as in RLR and is the entropy regularizer on the unlabeled data. Possible values for are fixed in the same manner as in RLR, and values for the entropy parameter are fixed uniformly from for all datasets.
- •
pSLR and dSLR: Recall that the penalty function is Possible values for the ridge parameter are fixed in the same manner as in RLR and ER, and values for are fixed uniformly from for all datasets.
- •
SVM: SVM solves the following optimization problem
[TABLE]
where for . Possible values for are fixed uniformly from for all datasets.
- •
TSVM: TSVM with the class balance constraint solves the following optimization problem
[TABLE]
where is the predicted label for . Possible values for are fixed uniformly from for all datasets. The parameter is automatically tuned in the implementation of SVM.
Logistic-type methods, RLR, ER, pSLR, and dSLR, are cross validated over the binomial deviance based on the labeled data, and SVM-type methods, SVM and TSVM, are cross validated over the accuracy. For our methods, the binomial deviance is computed on the CV test set, using the coefficient vector , where are the class sizes of labeled data in the CV training set. Because the performance measures (binomial deviance and accuracy) are based on labeled data only, the entire set of unlabeled data is used without split in training during CV for semi-supervised methods. In the case of a tie, the smaller or will be selected for RLR, SVM and TSM. For ER, pSLR, and dSLR, the smaller and the larger or will be selected.
VI Additional experiment results
For labeled training data size 100, Table S2 presents the accuracy results where intercept adjustment is applied in the Homo Prop scheme, but not applied in the Flip Prop scheme. Table S3 presents the AUC results, which are not affected by whether intercept adjustment is applied. Comparison between Tables 1 and S2 shows that, with intercept adjustment versus no adjustment, the accuracies of the methods, RLR, ER, pSLR, dSLR, and SVM, are decreased only slightly in the Homo Prop scheme, but become substantially improved in the Flip Prop scheme.
For labeled training data size 25, Table S4 presents the accuracy results similarly as in Table 1, where intercept adjustment is not applied in the Homo Prop scheme, but applied in the Flip Prop scheme. Table S5 presents the accuracy results where intercept adjustment is applied in the Homo Prop scheme, but not applied in the Flip Prop scheme. Table S6 presents the AUC results, which are not affected by whether intercept adjustment is applied.
