Semi-supervised Logistic Learning Based on Exponential Tilt Mixture   Models

Xinwei Zhang; Zhiqiang Tan

arXiv:1906.07882·stat.ML·June 20, 2019

Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models

Xinwei Zhang, Zhiqiang Tan

PDF

Open Access

TL;DR

This paper introduces a semi-supervised logistic learning approach utilizing exponential tilt mixture models, enhancing classification accuracy by effectively leveraging both labeled and unlabeled data.

Contribution

It develops a novel semi-supervised logistic method based on exponential tilt models, with new objective functions, regularized estimation, and interpretable EM algorithms.

Findings

01

Proposed methods outperform existing semi-supervised classifiers.

02

Theoretical properties such as Fisher consistency are established.

03

Numerical experiments demonstrate improved prediction accuracy.

Abstract

Consider semi-supervised learning for classification, where both labeled and unlabeled data are available for training. The goal is to exploit both datasets to achieve higher prediction accuracy than just using labeled data alone. We develop a semi-supervised logistic learning method based on exponential tilt mixture models, by extending a statistical equivalence between logistic regression and exponential tilt modeling. We study maximum nonparametric likelihood estimation and derive novel objective functions which are shown to be Fisher consistent. We also propose regularized estimation and construct simple and highly interpretable EM algorithms. Finally, we present numerical results which demonstrate the advantage of the proposed methods compared with existing methods.

Tables7

Table 1. Table 1: Classification accuracy in % (mean ± plus-or-minus \pm sd) on test data over 20 repeated runs, with labeled training data size 100. Subscript a indicates that intercept adjustment is applied (see the text).

Homo Prop	RLR	ER	pSLR	dSLR	SVM	TSVM
AUSTRA	85.37 $\pm$ 2.00	85.50 $\pm$ 1.94	85.43 $\pm$ 2.07	85.33 $\pm$ 2.03	85.37 $\pm$ 1.96	85.15 $\pm$ 1.79
BCW	95.76 $\pm$ 1.04	95.64 $\pm$ 1.08	95.71 $\pm$ 1.07	95.80 $\pm$ 1.07	96.13 $\pm$ 1.04	96.44 $\pm$ 0.92
GERMAN	72.16 $\pm$ 2.69	72.22 $\pm$ 2.60	72.39 $\pm$ 2.71	72.12 $\pm$ 2.68	70.65 $\pm$ 2.85	69.01 $\pm$ 3.94
HEART	80.94 $\pm$ 3.56	81.04 $\pm$ 4.23	81.46 $\pm$ 3.41	80.94 $\pm$ 3.78	80.00 $\pm$ 4.35	80.36 $\pm$ 5.00
INON	84.38 $\pm$ 2.29	84.17 $\pm$ 2.16	85.00 $\pm$ 1.21	84.38 $\pm$ 2.29	83.92 $\pm$ 2.51	83.33 $\pm$ 3.18
LIVER	65.57 $\pm$ 4.37	65.83 $\pm$ 4.20	66.35 $\pm$ 4.65	66.22 $\pm$ 4.22	67.87 $\pm$ 3.22	64.83 $\pm$ 5.00
PIMA	74.71 $\pm$ 2.99	75.06 $\pm$ 3.07	75.02 $\pm$ 2.90	74.71 $\pm$ 3.04	74.65 $\pm$ 2.69	72.58 $\pm$ 3.97
SPAM	87.98 $\pm$ 2.70	88.04 $\pm$ 2.95	87.90 $\pm$ 2.49	87.94 $\pm$ 2.70	85.74 $\pm$ 3.90	87.04 $\pm$ 5.37
VEHICLE	93.10 $\pm$ 2.73	92.59 $\pm$ 2.80	92.45 $\pm$ 2.83	93.24 $\pm$ 2.82	92.38 $\pm$ 3.31	93.03 $\pm$ 3.39
VOTES	93.66 $\pm$ 2.55	93.59 $\pm$ 2.59	93.66 $\pm$ 2.32	93.45 $\pm$ 2.59	94.17 $\pm$ 2.57	94.03 $\pm$ 2.92
WDBC	95.92 $\pm$ 1.65	95.61 $\pm$ 1.70	95.89 $\pm$ 1.48	95.92 $\pm$ 1.65	95.67 $\pm$ 1.55	96.06 $\pm$ 1.31
BCI	66.50 $\pm$ 4.06	65.83 $\pm$ 3.80	65.86 $\pm$ 4.89	65.86 $\pm$ 4.40	68.46 $\pm$ 5.01	67.48 $\pm$ 5.20
COIL	78.95 $\pm$ 3.15	78.96 $\pm$ 3.24	79.07 $\pm$ 3.89	78.70 $\pm$ 3.42	80.10 $\pm$ 2.53	81.39 $\pm$ 2.38
DIGIT1	89.90 $\pm$ 1.11	89.29 $\pm$ 2.70	90.00 $\pm$ 1.18	89.87 $\pm$ 1.16	89.30 $\pm$ 1.33	89.73 $\pm$ 1.45
USPS	85.39 $\pm$ 2.38	85.62 $\pm$ 2.26	85.97 $\pm$ 1.91	85.54 $\pm$ 2.45	85.50 $\pm$ 2.13	84.71 $\pm$ 2.27
Average accuracy	83.35	83.27	83.48	83.33	83.34	83.01
# within 1% of highest	12/15	12/15	12/15	12/15	10/15	7/15

Table 2. Table S1: Statistics for data sets in numerical experiments

No	Data	# of obs	# of positive	# of negative	% of positive	feature dim
1	AUSTRA	690	383	307	55.51	14
2	BCW	683	444	239	65.01	9
3	GERMAN	1000	700	300	70.00	24
4	HEART	297	137	160	46.13	13
5	IONO	331	126	225	38.07	34
6	LIVER	345	145	200	42.03	6
7	PIMA	768	500	268	65.10	8
8	SPAM	4601	2788	1813	60.60	57
9	VEHICLE	435	218	217	50.11	18
10	VOTES	435	257	168	59.08	16
11	WDBC	569	357	212	62.74	31
12	BCI	400	200	200	50.00	117
13	COIL	1500	750	750	50.00	241
14	DIGIT1	1500	766	734	51.07	241
15	USPS	1500	1200	300	80.00	241

Table 3. Table S2: Classification accuracy in % (mean ± plus-or-minus \pm sd) on test data over 20 repeated runs, with labeled training data size 100. Subscript a indicates that intercept adjustment is applied. Compared with Table 1 , intercept adjustment in applied in the Homo Prop scheme, but not in the Flip Prop scheme.

Homo Prop	RLR_a	ER_a	pSLR_a	dSLR_a	SVM_a
AUSTRA	85.48 $\pm$ 1.81	85.74 $\pm$ 1.92	85.39 $\pm$ 1.89	85.50 $\pm$ 1.79	85.22 $\pm$ 2.05
BCW	96.31 $\pm$ 1.08	96.09 $\pm$ 1.07	96.22 $\pm$ 1.16	96.29 $\pm$ 1.07	96.62 $\pm$ 1.00
GERMAN	66.40 $\pm$ 3.09	66.73 $\pm$ 3.02	67.15 $\pm$ 2.84	66.23 $\pm$ 3.12	63.89 $\pm$ 7.30
HEART	80.26 $\pm$ 4.20	80.73 $\pm$ 4.39	80.94 $\pm$ 3.48	80.21 $\pm$ 4.17	79.69 $\pm$ 4.37
INON	82.08 $\pm$ 2.74	83.79 $\pm$ 3.00	81.88 $\pm$ 3.22	82.08 $\pm$ 2.74	80.67 $\pm$ 3.73
LIVER	62.91 $\pm$ 6.11	63.04 $\pm$ 6.39	62.91 $\pm$ 6.27	63.22 $\pm$ 6.32	64.83 $\pm$ 3.40
PIMA	72.97 $\pm$ 3.72	73.01 $\pm$ 3.51	72.93 $\pm$ 3.71	72.77 $\pm$ 3.85	72.40 $\pm$ 2.55
SPAM	88.52 $\pm$ 2.38	89.06 $\pm$ 2.97	88.60 $\pm$ 2.75	88.44 $\pm$ 2.34	86.68 $\pm$ 4.28
VEHICLE	93.10 $\pm$ 2.73	92.59 $\pm$ 2.80	92.45 $\pm$ 2.83	93.24 $\pm$ 2.82	92.45 $\pm$ 3.60
VOTES	93.59 $\pm$ 2.56	93.62 $\pm$ 2.37	93.48 $\pm$ 2.44	93.48 $\pm$ 2.66	94.00 $\pm$ 2.57
WDBC	95.28 $\pm$ 1.73	95.58 $\pm$ 1.70	95.58 $\pm$ 1.48	95.19 $\pm$ 1.78	95.75 $\pm$ 2.11
BCI	66.50 $\pm$ 4.06	65.83 $\pm$ 3.80	65.86 $\pm$ 4.89	65.86 $\pm$ 4.40	69.10 $\pm$ 5.10
COIL	78.95 $\pm$ 3.15	78.96 $\pm$ 3.24	79.07 $\pm$ 3.89	78.70 $\pm$ 3.42	80.22 $\pm$ 2.62
DIGIT1	89.77 $\pm$ 1.11	89.32 $\pm$ 2.67	90.07 $\pm$ 1.16	89.88 $\pm$ 1.10	89.47 $\pm$ 1.40
USPS	81.39 $\pm$ 4.31	81.91 $\pm$ 3.99	81.06 $\pm$ 3.95	81.68 $\pm$ 4.31	80.63 $\pm$ 3.88
Average accuracy	82.23	82.40	82.24	82.18	82.11
# within 1% of highest	11/15	12/15	11/15	11/15	10/15

Table 4. Table S3: Classification AUC in % (mean ± plus-or-minus \pm sd) on test data over 20 repeated runs, with labeled training data size 100. The AUC is not affected by whether intercept adjustment is applied.

Homo Prop	RLR	ER	pSLR	dSLR	SVM
AUSTRA	91.22 $\pm$ 1.70	91.52 $\pm$ 1.46	91.23 $\pm$ 1.62	91.21 $\pm$ 1.70	91.09 $\pm$ 2.03
BCW	99.18 $\pm$ 0.52	99.23 $\pm$ 0.47	99.15 $\pm$ 0.55	99.18 $\pm$ 0.52	99.30 $\pm$ 0.38
GERMAN	72.20 $\pm$ 3.12	72.49 $\pm$ 2.99	72.46 $\pm$ 2.99	72.20 $\pm$ 3.13	69.44 $\pm$ 10.22
HEART	89.03 $\pm$ 3.06	88.96 $\pm$ 3.34	88.96 $\pm$ 3.13	89.05 $\pm$ 3.06	87.90 $\pm$ 3.67
INON	86.36 $\pm$ 0.81	85.67 $\pm$ 4.99	86.66 $\pm$ 0.61	86.36 $\pm$ 0.81	83.03 $\pm$ 5.86
LIVER	67.98 $\pm$ 6.07	67.90 $\pm$ 6.11	68.24 $\pm$ 6.07	68.37 $\pm$ 6.05	70.91 $\pm$ 2.58
PIMA	79.88 $\pm$ 3.86	80.03 $\pm$ 3.84	79.89 $\pm$ 3.82	79.83 $\pm$ 3.83	79.64 $\pm$ 3.36
SPAM	93.83 $\pm$ 2.11	94.26 $\pm$ 2.07	93.58 $\pm$ 2.24	93.79 $\pm$ 2.14	92.36 $\pm$ 3.21
VEHICLE	97.28 $\pm$ 1.75	96.19 $\pm$ 2.11	96.21 $\pm$ 2.13	97.28 $\pm$ 1.77	95.86 $\pm$ 2.97
VOTES	98.10 $\pm$ 1.13	98.03 $\pm$ 1.08	98.07 $\pm$ 1.14	98.05 $\pm$ 1.17	97.97 $\pm$ 1.25
WDBC	99.12 $\pm$ 0.81	99.10 $\pm$ 0.82	99.13 $\pm$ 0.77	99.08 $\pm$ 0.85	99.14 $\pm$ 0.78
BCI	73.17 $\pm$ 4.33	72.82 $\pm$ 3.98	72.45 $\pm$ 4.69	72.42 $\pm$ 4.25	76.11 $\pm$ 5.18
COIL	85.18 $\pm$ 3.63	85.22 $\pm$ 3.84	85.48 $\pm$ 3.83	85.07 $\pm$ 3.55	84.86 $\pm$ 2.72
DIGIT1	96.69 $\pm$ 0.70	96.32 $\pm$ 1.51	96.70 $\pm$ 0.69	96.69 $\pm$ 0.70	96.40 $\pm$ 0.76
USPS	86.12 $\pm$ 2.89	86.28 $\pm$ 2.82	86.07 $\pm$ 2.90	86.10 $\pm$ 2.87	83.47 $\pm$ 4.63
Average AUC	87.69	87.60	87.62	87.65	87.17
# within 1% of highest	13/15	12/15	12/15	13/15	9/15

Table 5. Table S4: Classification accuracy in % (mean ± plus-or-minus \pm sd) on test data over 20 repeated runs, with labeled training data size 25. Subscript a indicates that intercept adjustment is applied similarly as in Table 1 .

Homo Prop	RLR	ER	pSLR	dSLR	SVM	TSVM
AUSTRA	82.37 $\pm$ 2.47	81.80 $\pm$ 2.56	81.76 $\pm$ 3.05	81.35 $\pm$ 3.62	80.57 $\pm$ 3.87	80.46 $\pm$ 4.84
BCW	94.84 $\pm$ 2.08	94.76 $\pm$ 2.72	95.82 $\pm$ 1.65	94.62 $\pm$ 2.11	94.20 $\pm$ 2.22	96.67 $\pm$ 0.81
GERMAN	69.50 $\pm$ 2.22	69.76 $\pm$ 2.36	69.83 $\pm$ 2.10	69.47 $\pm$ 2.40	68.03 $\pm$ 4.02	63.89 $\pm$ 5.09
HEART	79.48 $\pm$ 3.90	79.01 $\pm$ 4.20	78.80 $\pm$ 3.74	79.27 $\pm$ 3.97	78.54 $\pm$ 4.44	76.72 $\pm$ 4.22
INON	77.21 $\pm$ 5.97	75.33 $\pm$ 6.75	76.04 $\pm$ 6.77	76.38 $\pm$ 6.12	77.04 $\pm$ 6.51	76.71 $\pm$ 6.98
LIVER	57.70 $\pm$ 6.40	56.91 $\pm$ 5.20	56.78 $\pm$ 5.29	57.70 $\pm$ 6.25	60.17 $\pm$ 6.63	58.04 $\pm$ 8.68
PIMA	68.09 $\pm$ 3.61	67.66 $\pm$ 3.67	67.03 $\pm$ 3.73	67.83 $\pm$ 3.68	67.60 $\pm$ 3.62	65.25 $\pm$ 4.95
SPAM	82.76 $\pm$ 3.35	83.10 $\pm$ 3.67	83.74 $\pm$ 2.57	82.68 $\pm$ 3.28	81.28 $\pm$ 3.77	85.40 $\pm$ 2.81
VEHICLE	73.93 $\pm$ 6.71	73.90 $\pm$ 7.06	76.00 $\pm$ 7.20	74.31 $\pm$ 6.32	79.21 $\pm$ 5.01	76.28 $\pm$ 8.24
VOTES	92.10 $\pm$ 3.68	91.34 $\pm$ 3.63	91.93 $\pm$ 3.64	91.90 $\pm$ 3.63	91.62 $\pm$ 3.36	92.07 $\pm$ 3.28
WDBC	92.17 $\pm$ 3.21	89.94 $\pm$ 7.91	92.28 $\pm$ 3.83	91.83 $\pm$ 3.37	91.39 $\pm$ 2.95	92.56 $\pm$ 2.51
BCI	56.88 $\pm$ 4.98	54.96 $\pm$ 4.97	55.45 $\pm$ 5.32	56.32 $\pm$ 5.08	56.02 $\pm$ 5.26	54.85 $\pm$ 4.67
COIL	60.62 $\pm$ 5.47	58.63 $\pm$ 6.85	57.18 $\pm$ 7.21	61.29 $\pm$ 5.25	63.65 $\pm$ 6.64	64.78 $\pm$ 6.83
DIGIT1	82.04 $\pm$ 3.83	82.09 $\pm$ 3.83	83.05 $\pm$ 4.13	82.08 $\pm$ 3.78	80.39 $\pm$ 4.82	84.74 $\pm$ 3.29
USPS	81.15 $\pm$ 2.67	81.13 $\pm$ 2.60	80.89 $\pm$ 2.66	81.34 $\pm$ 2.29	81.62 $\pm$ 2.96	77.57 $\pm$ 4.15
Average accuracy	76.72	76.02	76.44	76.56	76.76	76.40
# within 1% of highest	9/15	6/15	7/15	8/15	8/15	7/15

Table 6. Table S5: Classification accuracy in % (mean ± plus-or-minus \pm sd) on test data over 20 repeated runs, with labeled training data size 25. Subscript a indicates that intercept adjustment is applied. Compared with Table S4 , intercept adjustment in applied in the Homo Prop scheme, but not in the Flip Prop scheme.

Homo Prop	RLR_a	ER_a	pSLR_a	dSLR_a	SVM_a
AUSTRA	81.98 $\pm$ 2.37	81.80 $\pm$ 2.58	81.54 $\pm$ 2.91	81.22 $\pm$ 4.09	80.87 $\pm$ 3.38
BCW	95.53 $\pm$ 1.81	95.24 $\pm$ 2.37	95.96 $\pm$ 1.43	95.36 $\pm$ 1.90	95.96 $\pm$ 1.39
GERMAN	59.94 $\pm$ 5.16	58.50 $\pm$ 7.01	57.16 $\pm$ 6.41	61.91 $\pm$ 4.19	47.36 $\pm$ 13.28
HEART	79.64 $\pm$ 4.12	78.85 $\pm$ 3.99	79.17 $\pm$ 3.87	79.53 $\pm$ 4.17	79.06 $\pm$ 4.39
INON	78.67 $\pm$ 5.44	77.08 $\pm$ 6.41	75.79 $\pm$ 5.73	77.29 $\pm$ 6.08	69.17 $\pm$ 17.61
LIVER	54.91 $\pm$ 7.69	54.96 $\pm$ 7.99	55.78 $\pm$ 7.83	55.65 $\pm$ 8.35	52.91 $\pm$ 10.55
PIMA	65.51 $\pm$ 4.41	64.79 $\pm$ 5.17	65.43 $\pm$ 5.06	66.04 $\pm$ 4.35	56.05 $\pm$ 16.86
SPAM	82.98 $\pm$ 2.55	83.42 $\pm$ 3.15	84.18 $\pm$ 2.26	83.14 $\pm$ 2.74	82.46 $\pm$ 2.82
VEHICLE	74.00 $\pm$ 6.49	73.97 $\pm$ 6.56	76.10 $\pm$ 6.69	74.86 $\pm$ 5.84	76.55 $\pm$ 11.76
VOTES	91.34 $\pm$ 3.74	90.72 $\pm$ 3.81	91.24 $\pm$ 3.53	91.28 $\pm$ 3.75	90.38 $\pm$ 4.18
WDBC	92.78 $\pm$ 3.05	89.92 $\pm$ 8.05	92.58 $\pm$ 4.26	92.19 $\pm$ 3.37	91.83 $\pm$ 2.97
BCI	56.99 $\pm$ 4.25	55.49 $\pm$ 4.89	55.79 $\pm$ 5.39	56.32 $\pm$ 4.63	50.68 $\pm$ 8.15
COIL	60.39 $\pm$ 5.80	59.83 $\pm$ 6.48	58.36 $\pm$ 7.09	60.49 $\pm$ 5.98	58.28 $\pm$ 12.80
DIGIT1	82.26 $\pm$ 3.95	82.23 $\pm$ 3.82	83.07 $\pm$ 4.24	82.08 $\pm$ 3.86	80.97 $\pm$ 3.71
USPS	75.46 $\pm$ 7.80	74.05 $\pm$ 9.72	70.75 $\pm$ 8.30	76.22 $\pm$ 6.98	58.90 $\pm$ 22.80
Average accuracy	75.49	74.72	74.86	75.57	71.43
# within 1% of highest	12/15	8/15	10/15	12/15	5/15

Table 7. Table S6: Classification AUC in % (mean ± plus-or-minus \pm sd) on test data over 20 repeated runs, with labeled training data size 25. The AUC is not affected by whether intercept adjustment is applied.

Homo Prop	RLR	ER	pSLR	dSLR	SVM
AUSTRA	89.39 $\pm$ 2.09	89.27 $\pm$ 2.26	88.91 $\pm$ 2.23	89.10 $\pm$ 2.23	88.78 $\pm$ 3.04
BCW	99.15 $\pm$ 0.40	98.97 $\pm$ 0.82	99.13 $\pm$ 0.43	99.19 $\pm$ 0.36	99.09 $\pm$ 0.87
GERMAN	61.80 $\pm$ 5.11	60.80 $\pm$ 5.14	59.92 $\pm$ 5.50	60.76 $\pm$ 5.52	48.83 $\pm$ 15.78
HEART	86.56 $\pm$ 3.57	86.48 $\pm$ 3.64	86.38 $\pm$ 3.66	86.55 $\pm$ 3.52	86.60 $\pm$ 4.60
IONO	78.52 $\pm$ 6.41	78.03 $\pm$ 6.92	75.29 $\pm$ 9.21	75.54 $\pm$ 9.25	69.83 $\pm$ 19.66
LIVER	58.47 $\pm$ 9.90	57.64 $\pm$ 10.62	59.06 $\pm$ 10.52	59.35 $\pm$ 10.42	55.93 $\pm$ 14.49
PIMA	69.94 $\pm$ 8.03	69.87 $\pm$ 8.38	69.33 $\pm$ 8.48	70.03 $\pm$ 8.09	58.31 $\pm$ 22.00
SPAM	90.96 $\pm$ 2.14	91.01 $\pm$ 2.77	90.31 $\pm$ 2.73	90.89 $\pm$ 2.18	89.99 $\pm$ 2.84
VEHICLE	79.24 $\pm$ 6.76	79.50 $\pm$ 6.94	81.45 $\pm$ 6.18	80.71 $\pm$ 6.46	80.60 $\pm$ 17.10
VOTES	96.88 $\pm$ 2.07	97.09 $\pm$ 1.51	96.96 $\pm$ 1.89	96.91 $\pm$ 2.03	97.03 $\pm$ 1.79
WDBC	97.98 $\pm$ 1.38	96.87 $\pm$ 4.01	97.87 $\pm$ 1.46	97.94 $\pm$ 1.43	97.49 $\pm$ 1.53
BCI	60.30 $\pm$ 5.44	57.76 $\pm$ 5.74	58.30 $\pm$ 6.41	60.75 $\pm$ 5.50	50.98 $\pm$ 10.58
COIL	64.29 $\pm$ 5.79	63.11 $\pm$ 5.74	62.40 $\pm$ 5.88	64.61 $\pm$ 5.45	60.35 $\pm$ 14.91
DIGIT1	90.85 $\pm$ 3.84	90.86 $\pm$ 3.87	91.26 $\pm$ 4.09	90.85 $\pm$ 3.83	90.16 $\pm$ 3.85
USPS	74.09 $\pm$ 6.85	73.46 $\pm$ 6.86	73.93 $\pm$ 6.44	74.05 $\pm$ 6.82	60.99 $\pm$ 23.77
Average AUC	79.89	79.38	79.37	79.82	75.66
# within 1% of highest	13/15	9/15	10/15	12/15	6/15

Equations252

P (y = 1∣ x) = \frac{exp ( β _{0}^{c} + β _{1}^{T} x )}{1 + exp ( β _{0}^{c} + β _{1}^{T} x )},

P (y = 1∣ x) = \frac{exp ( β _{0}^{c} + β _{1}^{T} x )}{1 + exp ( β _{0}^{c} + β _{1}^{T} x )},

\displaystyle\sum_{i=1}^{n}\Big{[}y_{i}(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{i})-\log\{1+\exp(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{i})\}\Big{]}.

\displaystyle\sum_{i=1}^{n}\Big{[}y_{i}(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{i})-\log\{1+\exp(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{i})\}\Big{]}.

\frac{d P ( x ∣ y = 1 )}{d P ( x ∣ y = 0 )} = \frac{d G _{1}}{d G _{0}} = e^{β_{0} + β_{1}^{T} x},

\frac{d P ( x ∣ y = 1 )}{d P ( x ∣ y = 0 )} = \frac{d G _{1}}{d G _{0}} = e^{β_{0} + β_{1}^{T} x},

\displaystyle\sum_{i=1}^{n}\Big{[}(1-y_{i})+y_{i}(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{i})+\log G_{0}(\{x_{i}\})\Big{]},

\displaystyle\sum_{i=1}^{n}\Big{[}(1-y_{i})+y_{i}(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{i})+\log G_{0}(\{x_{i}\})\Big{]},

S_{1} = {x_{i} : y_{i} = 0, i = 1, \dots, n} \mbox d r a w n f r o m P_{1} (x) = P (x ∣ y = 0),

S_{1} = {x_{i} : y_{i} = 0, i = 1, \dots, n} \mbox d r a w n f r o m P_{1} (x) = P (x ∣ y = 0),

S_{2} = {x_{i} : y_{i} = 1, i = 1, \dots, n} \mbox d r a w n f r o m P_{2} (x) = P (x ∣ y = 1),

S_{3} = {x_{i} : i = n + 1, \dots, N} \mbox d r a w n f r o m P_{3} (x) = P^{u} (x) .

d P_{1} (x) = d G_{0} (x),

d P_{1} (x) = d G_{0} (x),

d P_{2} (x) = d G_{1} (x),

d P_{3} (x) = (1 - ρ) d G_{0} (x) + ρ d G_{1} (x),

\frac{d G _{1}}{d G _{0}} = e^{β_{0} + β_{1}^{T} x},

\frac{d G _{1}}{d G _{0}} = e^{β_{0} + β_{1}^{T} x},

d P_{j} = (1 - ρ_{j}) d G_{0} + ρ_{j} d G_{1}, j = 1, 2, 3,

d P_{j} = (1 - ρ_{j}) d G_{0} + ρ_{j} d G_{1}, j = 1, 2, 3,

\displaystyle l(\rho,\beta,G_{0})=\frac{1}{N}\sum_{j=1}^{3}\sum_{i=1}^{n_{j}}\Big{[}\log\{1-\rho_{j}+\rho_{j}\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})\}+\log G_{0}(\{x_{ji}\})\Big{]},

\displaystyle l(\rho,\beta,G_{0})=\frac{1}{N}\sum_{j=1}^{3}\sum_{i=1}^{n_{j}}\Big{[}\log\{1-\rho_{j}+\rho_{j}\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})\}+\log G_{0}(\{x_{ji}\})\Big{]},

κ (ρ, β, α) = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} lo g {\frac{1 - ρ _{j} + ρ _{j} exp ( β _{0} + β _{1}^{T} x _{j i} )}{1 - α + α exp ( β _{0} + β _{1}^{T} x _{j i} )}} - lo g N,

κ (ρ, β, α) = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} lo g {\frac{1 - ρ _{j} + ρ _{j} exp ( β _{0} + β _{1}^{T} x _{j i} )}{1 - α + α exp ( β _{0} + β _{1}^{T} x _{j i} )}} - lo g N,

1 = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} \frac{1}{1 - α + α exp ( β _{0} + β _{1}^{T} x _{j i} )} .

1 = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} \frac{1}{1 - α + α exp ( β _{0} + β _{1}^{T} x _{j i} )} .

(ρ, β) max α min κ (ρ, β, α) .

(ρ, β) max α min κ (ρ, β, α) .

P (z = 1∣ x) = \frac{n _{1}}{N} \frac{1}{1 - α ~ ( ρ ) + α ~ ( ρ ) exp ( β _{0} + β _{1}^{T} x )},

P (z = 1∣ x) = \frac{n _{1}}{N} \frac{1}{1 - α ~ ( ρ ) + α ~ ( ρ ) exp ( β _{0} + β _{1}^{T} x )},

P (z = 2∣ x) = \frac{n _{2}}{N} \frac{exp ( β _{0} + β _{1}^{T} x )}{1 - α ~ ( ρ ) + α ~ ( ρ ) exp ( β _{0} + β _{1}^{T} x )},

P (z = 3∣ x) = \frac{n _{3}}{N} \frac{1 - ρ + ρ exp ( β _{0} + β _{1}^{T} x )}{1 - α ~ ( ρ ) + α ~ ( ρ ) exp ( β _{0} + β _{1}^{T} x )},

κ {ρ, β, \tilde{α} (ρ)} = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} lo g {\frac{1 - ρ _{j} + ρ _{j} exp ( β _{0} + β _{1}^{T} x _{j i} )}{1 - α ~ ( ρ ) + α ~ ( ρ ) exp ( β _{0} + β _{1}^{T} x _{j i} )}} - lo g N .

κ {ρ, β, \tilde{α} (ρ)} = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} lo g {\frac{1 - ρ _{j} + ρ _{j} exp ( β _{0} + β _{1}^{T} x _{j i} )}{1 - α ~ ( ρ ) + α ~ ( ρ ) exp ( β _{0} + β _{1}^{T} x _{j i} )}} - lo g N .

κ (ρ, h, α) = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} lo g {\frac{1 - ρ _{j} + ρ _{j} exp ( h ( x _{j i} ))}{1 - α + α exp ( h ( x _{j i} ))}} - lo g N .

κ (ρ, h, α) = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} lo g {\frac{1 - ρ _{j} + ρ _{j} exp ( h ( x _{j i} ))}{1 - α + α exp ( h ( x _{j i} ))}} - lo g N .

α \in (0, 1) min κ^{*} (ρ, h, α) \leq κ^{*} {ρ, h, \tilde{α} (ρ)} \leq κ^{*} {ρ^{*}, h^{*}, \tilde{α} (ρ^{*})},

α \in (0, 1) min κ^{*} (ρ, h, α) \leq κ^{*} {ρ, h, \tilde{α} (ρ)} \leq κ^{*} {ρ^{*}, h^{*}, \tilde{α} (ρ^{*})},

\displaystyle E\left\{\mbox{KL}\Big{(}w(\cdot,x;\rho^{*},h^{*})\|w(\cdot,x;\rho,h)\Big{)}\right\},

\displaystyle E\left\{\mbox{KL}\Big{(}w(\cdot,x;\rho^{*},h^{*})\|w(\cdot,x;\rho,h)\Big{)}\right\},

κ {ρ, β, \overset{α}{^} (β)} - λ ∥ β_{1} ∥_{2}^{2} + γ (1 - ρ^{0}) (n_{3} / N) lo g (1 - ρ) + γ ρ^{0} (n_{3} / N) lo g ρ .

κ {ρ, β, \overset{α}{^} (β)} - λ ∥ β_{1} ∥_{2}^{2} + γ (1 - ρ^{0}) (n_{3} / N) lo g (1 - ρ) + γ ρ^{0} (n_{3} / N) lo g ρ .

κ {ρ, β, \tilde{α} (ρ)} - λ ∥ β_{1} ∥_{2}^{2} + γ (1 - ρ^{0}) (n_{3} / N) lo g (1 - ρ) + γ ρ^{0} (n_{3} / N) lo g ρ .

κ {ρ, β, \tilde{α} (ρ)} - λ ∥ β_{1} ∥_{2}^{2} + γ (1 - ρ^{0}) (n_{3} / N) lo g (1 - ρ) + γ ρ^{0} (n_{3} / N) lo g ρ .

Q^{(t)} (ρ, β, G_{0})

Q^{(t)} (ρ, β, G_{0})

\displaystyle\quad+\text{E}^{(t)}u_{ji}\log\{\rho_{j}\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})G_{0}(\{x_{ji}\})\}\Big{]}+\mbox{pen}(\rho,\beta),

κ_{Q}^{(t)} (ρ, β, α)

κ_{Q}^{(t)} (ρ, β, α)

\displaystyle\quad+\text{E}^{(t)}u_{ji}\log\left\{\frac{\rho_{j}\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})}{1-\alpha+\alpha\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})}\right\}\Big{]}-\log N+\mbox{pen}(\rho,\beta).

ρ^{(t + 1)} = \frac{n _{3}^{- 1} \sum _{i = 1}^{n_{3}} E ^{(t)} u _{j i} + γ ρ ^{0}}{1 + γ}, α^{(t + 1)} = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} E^{(t)} u_{j i} .

ρ^{(t + 1)} = \frac{n _{3}^{- 1} \sum _{i = 1}^{n_{3}} E ^{(t)} u _{j i} + γ ρ ^{0}}{1 + γ}, α^{(t + 1)} = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j} E^{(t)} u_{j i} .

\displaystyle\frac{1}{N}\sum_{j=1}^{3}\sum_{i=1}^{n_{j}}\Big{[}\text{E}^{(t)}u_{ji}(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})-\log\{1-\alpha^{(t+1)}+\alpha^{(t+1)}\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})\}\Big{]}-\lambda\|\beta_{1}\|_{2}^{2},

\displaystyle\frac{1}{N}\sum_{j=1}^{3}\sum_{i=1}^{n_{j}}\Big{[}\text{E}^{(t)}u_{ji}(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})-\log\{1-\alpha^{(t+1)}+\alpha^{(t+1)}\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})\}\Big{]}-\lambda\|\beta_{1}\|_{2}^{2},

\displaystyle\sum_{j=1}^{3}\sum_{i=1}^{n_{j}}\Big{[}\text{E}^{(t)}u_{ji}(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})-\log\{1+\exp(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})\}\Big{]}.

\displaystyle\sum_{j=1}^{3}\sum_{i=1}^{n_{j}}\Big{[}\text{E}^{(t)}u_{ji}(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})-\log\{1+\exp(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})\}\Big{]}.

u_{i} ∣ (z_{i} = j, x_{i}) \sim \mbox B er n o u l l i {\frac{ρ _{j} exp ( β _{0} + β _{1}^{T} x _{i} )}{1 - ρ _{j} + ρ _{j} exp ( β _{0} + β _{1}^{T} x _{i} )}} .

u_{i} ∣ (z_{i} = j, x_{i}) \sim \mbox B er n o u l l i {\frac{ρ _{j} exp ( β _{0} + β _{1}^{T} x _{i} )}{1 - ρ _{j} + ρ _{j} exp ( β _{0} + β _{1}^{T} x _{i} )}} .

\tilde{Q}^{(t)} (ρ, β) = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j}

\tilde{Q}^{(t)} (ρ, β) = \frac{1}{N} j = 1 \sum 3 i = 1 \sum n_{j}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Genetic and phenotypic traits in livestock · Statistical Methods and Inference

MethodsLogistic Regression

Full text

Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models

Xinwei Zhang & Zhiqiang Tan

Department of Statistics, Rutgers University, USA

[email protected], [email protected]

Abstract

Consider semi-supervised learning for classification, where both labeled and unlabeled data are available for training. The goal is to exploit both datasets to achieve higher prediction accuracy than just using labeled data alone. We develop a semi-supervised logistic learning method based on exponential tilt mixture models, by extending a statistical equivalence between logistic regression and exponential tilt modeling. We study maximum nonparametric likelihood estimation and derive novel objective functions which are shown to be Fisher consistent. We also propose regularized estimation and construct simple and highly interpretable EM algorithms. Finally, we present numerical results which demonstrate the advantage of the proposed methods compared with existing methods.

1 Introduction

Semi-supervised learning for classification involves exploiting a large amount of unlabeled data and a relatively small amount of labeled data to build better classifiers. This approach can potentially be used to achieve higher accuracy, with a limited budget for obtaining labeled data. Various methods have been proposed, including expectation-maximization (EM) algorithms, transductive support vector machines (SVMs), and regularized methods (e.g., Chapelle et al. 2006; Zhu 2008).

For supervised classification, there are a range of objective functions which are Fisher consistent in the following sense: optimization of the population, nonparametric version of a loss function leads to the true conditional probability function of labels given features as for the logistic loss, or to the Bayes classifier as for the hinge loss (Lin 2002; Bartlett et al. 2006). In contrast, a perplexing issue we notice for semi-supervised classification is that existing objective functions are in general not Fisher consistent, unless in the degenerate case where unlabeled data are ignored and only labeled data are used. Examples include the objective functions in transductive SVMs (Vapnik 1998; Joachims 1999) and various regularized methods (Grandvalet & Bengio 2005; Mann & McCallum 2007; Krishnapuram et al. 2005). The lack of Fisher consistency may contribute to unstable performances of existing semi-supervised classifiers (e.g., Li & Zhou 2015). Another restriction in existing methods is that the class proportions in labeled and unlabeled data are typically assumed to be the same.

We develop a semi-supervised extension of logistic regression based on exponential tilt mixture models (Qin 1999; Zou et al. 2002; Tan 2009), without restricting the class proportions in the unlabeled data to be the same as in the labeled data. The development is motivated by a statistical equivalence between logistic regression for the conditional probability of a label given features and exponential tilt modeling for the density ratio between the feature distributions within different labels (Anderson 1972; Prentice & Pyke 1979). Our work involves two main contributions: (i) we derive novel objective functions which are shown not only to be Fisher consistent but also lead to asymptotically more efficient estimation than based on labeled data only, and (ii) we propose regularized estimation and construct computationally and conceptually desirable EM algorithms. From numerical experiments, our methods achieve a substantial advantage over existing methods when the class proportions in unlabeled data differ from those in labeled data. A possible explanation is that while the class proportions in unlabeled data are estimated as unknown parameters in our methods, they are implicitly assumed to be the same as in labeled data for existing methods including transductive SVMs (Joachims 1999) and entropy regularization (Grandvalet & Bengio 2005).

A simple, informative example is provided in the Supplement (Section II) to highlight comparison between new and existing methods mentioned above.

2 Background: logistic regression and exponential tilt model

For supervised classification, the training data consist of a sample $\{(y_{i},x_{i}):i=1,\ldots,n\}$ of $(y,x)$ , where $x\in\mathbb{R}^{p}$ and $y\in\{0,1\}$ representing a feature vector and an associated label respectively. Consider a logistic regression model

[TABLE]

where $\beta_{1}$ is a coefficient vector associated with $x$ , and $\beta_{0}^{c}$ is an intercept, with superscript c indicating classification or conditional probability of $y=1$ given $x$ . The maximum likelihood estimator (MLE) $(\tilde{\beta}_{0}^{c},\tilde{\beta}_{1})$ is defined as a maximizer of the log (conditional) likelihood:

[TABLE]

In general, nonlinear functions of $x$ can be used in place of $\beta_{1}^{\mathrm{\scriptscriptstyle T}}x$ , and a penalty term can be incorporated into the log-likelihood such as the ridge penalty $\|\beta_{1}\|_{2}^{2}$ or the squared norm of a reproducing kernel Hilbert space of functions of $x$ . We discuss these issues later in Sections 3.3 and 6.

Interestingly, logistic regression on $P(y|x)$ can be made equivalent to an exponential tilt model on $P(x|y)$ (Anderson 1972; Prentice & Pyke 1979; Qin 1998). Denote by $G_{0}$ or $G_{1}$ the conditional distribution $P(x|y=0)$ or $P(x|y=1)$ respectively, and $\pi=P(y=1)$ . By the Bayes rule, model (1) is equivalent to the exponential tilt model

[TABLE]

where $\text{d}G_{1}/\text{d}G_{0}$ denotes the density ratio between $G_{1}$ and $G_{0}$ with respect to a dominating measure, and $\beta_{0}=\beta_{0}^{c}+\log\{(1-\pi)/\pi\}$ . Model (3) is explicitly a semi-parametric model, where $G_{0}$ is an infinitely-dimensional parameter and $(\beta_{0},\beta_{1})$ are finitely-dimensional parameters. In fact, logistic model (1) is also semi-parametric, where the marginal distribution of $x$ is an infinitely-dimensional parameter, and $(\beta_{0}^{c},\beta_{1})$ are finitely-dimensional parameters. Furthermore, the MLE $(\tilde{\beta}_{0}^{c},\tilde{\beta}_{1})$ in model (1) can be related to the following estimator $(\hat{\beta}_{0},\hat{\beta}_{1})$ in model (3) by the method of nonparametric likelihood (Kiefer & Wolfowitz 1956) or empirical likelihood (Owen 2001). Formally, $(\hat{\beta}_{0},\hat{\beta}_{1},\hat{G}_{0})$ are defined as a maximizer of the log-likelihood,

[TABLE]

over all possible $(\beta_{0},\beta_{1},G_{0})$ such that $G_{0}$ is a probability measure supported on the pooled data $\{x_{i}:i=1,\ldots,n\}$ with $\int\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x)\text{d}G_{0}=1$ . Analytically, it can be shown that $\tilde{\beta}_{1}=\hat{\beta}_{1}$ , $\tilde{\beta}_{0}^{c}=\hat{\beta}_{0}+\log\{\hat{\pi}/(1-\hat{\pi})\}$ , where $\hat{\pi}=\sum_{i=1}^{n}y_{i}/n$ . See Qin (1998) and references therein.

By the foregoing discussion, we see that there are two statistically distinct but equivalent approaches for supervised classification: logistic regression or exponential tilt models. It is such a relationship that we aim to exploit in developing a new method for semi-supervised classification.

3 Theory and methods

For semi-supervised classification, the training data consist of a labeled sample $\mathcal{S}^{\ell}=\{(y_{i},x_{i}):i=1,\ldots,n\}$ and an unlabeled sample $\mathcal{S}^{u}=\{x_{i}:i=n+1,\ldots,N\}$ , for which the associated labels $\{y_{i}:i=n+1,\ldots,N\}$ are unobserved. Typically for existing methods including transductive SVMs, the two samples $\mathcal{S}^{\ell}$ and $\mathcal{S}^{u}$ are assumed to be from a common population of $(y,x)$ . However, we allow that $\mathcal{S}^{\ell}$ and $\mathcal{S}^{u}$ may be drawn from different populations, with the same conditional distribution $P(x|y)$ , but possibly different marginal probabilities $P^{\ell}(y=1)$ and $P^{u}(y=1)$ .

3.1 Exponential tilt mixture model

Although it seems difficult at first look to extend logistic model (1) for semi-supervised learning, we realize that both the labeled sample $\mathcal{S}^{\ell}$ and the unlabeled sample $\mathcal{S}^{u}$ can be taken account of by a natural extension of the exponential tilt model (3), called an exponential tilt mixture (ETM) model (Qin 1999; Zou et al. 2002; Tan 2009). Denote

[TABLE]

An exponential tilt mixture model for the three samples $(\mathcal{S}_{1},\mathcal{S}_{2},\mathcal{S}_{3})$ postulates that

[TABLE]

where $G_{0}$ or $G_{1}$ represents the conditional distribution of $x$ given $y=0$ or $y=1$ respectively in both the labeled and unlabeled data such that

[TABLE]

and $\rho=P^{u}(y=1)$ is the proportion of $y=1$ underlying the unlabeled data. While Eqs (5)–(6) merely give definitions of $G_{0}$ and $G_{1}$ , Eq (7) says that the feature distribution in the unlabeled sample is a mixture of $G_{0}$ and $G_{1}$ , which follows from the structural assumption that the conditional distribution $P(x|y)$ is invariant between the labeled and unlabeled samples. Eq (8) imposes a functional restriction on the density ratio between $G_{0}$ and $G_{1}$ , similarly as in (3).

The ETM model, defined by (5)–(8), is a semi-parametric model, with an infinitely-dimensional parameter $G_{0}$ and finitely-dimensional parameter $\rho$ and $\beta=(\beta_{0},\beta_{1}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$ . We briefly summarize maximum nonparametric likelihood estimation previously studied (Qin 1999; Zou et al. 2002; Tan 2009). For notational convenience, rewrite the sample $\mathcal{S}_{j}$ as $\{x_{ji}:i=1,\ldots,n_{j}\}$ , where $n_{1}=n-n_{2}$ , $n_{2}=\sum_{i=1}^{n}y_{i}$ , and $n_{3}=N-n$ . Eqs (5)–(7) can be expressed as

[TABLE]

where $\rho_{1}=0$ , $\rho_{2}=1$ , and $\rho_{3}=\rho$ . For any fixed $(\rho,\beta)$ , the average profile log-likelihood of $(\rho,\beta)$ is defined as $\text{pl}(\rho,\beta)=\max_{G_{0}}l(\rho,\beta,G_{0})$ with

[TABLE]

over all possible $G_{0}$ which is a probability measure supported on the pooled data $\{x_{ji}:i=1,\ldots,n_{j},j=1,2,3\}$ with $\int\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x)\text{d}G_{0}=1$ . Denote

[TABLE]

which can be easily shown to be concave in $\rho\in(0,1)$ and convex in $\alpha\in(0,1)$ . Then Proposition 1 in Tan (2009) leads to the following result.

Lemma 1.

The average profile log-likelihood of $(\rho,\beta)$ can be determined as $\text{pl}(\rho,\beta)=\min_{\alpha\in(0,1)}\kappa(\rho,\beta,\alpha)=\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ , where $\hat{\alpha}(\beta)$ is a minimizer of $\kappa(\rho,\beta,\alpha)$ over $\alpha$ , satisfying the stationary condition (free of $\rho$ )

[TABLE]

The maximum likelihood estimator of $(\rho,\beta)$ is then defined by maximizing the profile log-likelihood, that is, $(\hat{\rho},\hat{\beta})=\mbox{argmax}_{(\rho,\beta)}\text{pl}(\rho,\beta)$ . From Lemma 1, we notice that the estimators $\{\hat{\rho},\hat{\beta},\hat{\alpha}(\hat{\beta})\}$ jointly solve the saddle-point problem:

[TABLE]

Large sample theory of $(\hat{\rho},\hat{\beta})$ has been studied in Qin (1999) under standard regularity conditions as $N\to\infty$ and $n_{j}/N\to\eta_{j}$ with some constant $\eta_{j}>0$ for $j=1,2,3$ . The theory shows the existence of a local maximizer of $\text{pl}(\rho,\beta)$ , which is consistent and asymptotically normal provided the ETM model (5)–(8) is correctly specified. However, there remain subtle questions. It seems unclear whether the population version of the average profile log-likelihood $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ attains a global maximum at the true values of $(\rho,\beta)$ under a correctly specified ETM model. Moreover, what property can be deduced for $(\hat{\rho},\hat{\beta})$ under a misspecified ETM model?

3.2 Semi-supervised logistic regression

We derive a new classification model with parameters $(\rho,\beta)$ for the three samples $(\mathcal{S}_{1},\mathcal{S}_{2},\mathcal{S}_{3})$ such that an MLE of $(\rho,\beta)$ in the new model coincides with an MLE $(\hat{\rho},\hat{\beta})$ in the ETM model, and vice versa. Let $z_{i}=1+y_{i}$ if $i=1,\ldots,n$ and $z_{i}=3$ if $i=n+1,\ldots,N$ . Consider a conditional probability model for predicting the label $z_{i}$ from $x_{i}$ :

[TABLE]

where $\tilde{\alpha}(\rho)=\sum_{j=1}^{3}(n_{j}/N)\rho_{j}=(n_{2}+n_{3}\rho)/N$ , which ensures that $\sum_{j=1}^{3}P(z=j|x)\equiv 1$ . The model, defined by (12)–(14), will be called a semi-supervised logistic regression (SLR) model. The average log-likelihood function of $(\rho,\beta)$ with the data $\{(z_{i},x_{i}):i=1,\ldots,N\}$ in model (12)–(14) can be written, up to an additive constant free of $(\rho,\beta)$ , as

[TABLE]

Proposition 1.

If and only if $(\hat{\rho},\hat{\beta})$ is a local (or respectively global) maximizer of the average log-likelihood $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ in SLR model (12)–(14), then it is a local (or global) maximizer of the average profile log-likelihood $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ in ETM model (5)–(8).

Proposition 1 shows an equivalence between maximum nonparametric likelihood estimation in ETM model (5)–(8) and usual maximum likelihood estimation in SLR model (12)–(14), even though the objective functions $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ and $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ are not equivalent. This differs from the equivalence between logistic regression (1) and exponential tilt model (3) with labeled data only, where the log-likelihood (2) and the profile log-likelihood from (4) are equivalent (Prentice & Pyke 1979). From another angle, this result says that saddle-point problem (11) can be equivalently solved by directly maximizing $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ . This transformation is nontrivial, because a saddle-point problem in general cannot be converted into optimization with a closed-form objective.

By the identification of $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ as a usual log-likelihood function, we show that the objective functions $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ and $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ , with the linear predictor $\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x$ replaced by an arbitrary function $h(x)$ , are Fisher consistent nonparametrically, i.e., maximization of their population versions leads to the true values. This seems to be the first time Fisher consistency of a loss function is established for semi-supervised classification. By some abuse of notation, denote

[TABLE]

Proposition 2.

Suppose that $\mathcal{S}_{j}=\{x_{ji}:i=1,\ldots,n_{j}\}$ is drawn from $P_{j}$ in (5)–(7) for $j=1,2,3$ , with $\rho=\rho^{*}$ and $\text{d}G_{1}/\text{d}G_{0}=\exp(h^{*}(x))$ for some fixed value $\rho^{*}\in(0,1)$ and function $h^{*}(x)$ . Denote $\kappa^{*}(\rho,h,\alpha)=E\{\kappa(\rho,h,\alpha)\}$ . For any $\rho\in(0,1)$ and function $h(x)$ , we have

[TABLE]

where both equalities hold if $\rho=\rho^{*}$ and $h=h^{*}$ . Hence the population objective functions $\kappa^{*}\{\rho,h,\tilde{\alpha}(\rho)\}$ and $\min_{\alpha\in(0,1)}\kappa^{*}(\rho,h,\alpha)$ are maximized at the true value $\rho^{*}$ and function $h^{*}(x)$ .

Proposition 2 fills existing gaps in understanding maximum likelihood estimation in ETM model (5)–(8), through its equivalence with that in SLR model (12)–(14). If the ETM model is correctly specified, then the population version of $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ has a global maximum at the true values of $(\rho,\beta)$ , and hence a global maximizer $(\hat{\rho},\hat{\beta})$ is consistent under suitable regularity conditions. If the ETM model is misspecified, then by theory of estimation with misspecified models (Manski 1988; White 1982), the MLE $(\hat{\rho},\hat{\beta})$ converges in probability to a limit value which minimizes the difference between $\kappa^{*}\{\rho,h,\tilde{\alpha}(\rho)\}$ with $h=\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x$ and $\kappa^{*}\{\rho^{*},h^{*},\tilde{\alpha}(\rho^{*})\}$ . This difference as shown in the Supplement (Section IV.2) is the expected Kullback–Leibler divergence

[TABLE]

where $w(j,x;\rho,h)$ is the conditional probability (12)–(14) for $j=1,2,3$ , $\mbox{KL}(q^{*}\|q)$ is the Kullback–Leibler divergence between two probability vectors $(q^{*}_{j})$ and $(q_{j})$ , and $E(\cdot)$ denotes the expectation with respect to $(1-\tilde{\alpha}(\rho^{*}))\text{d}G_{0}+\tilde{\alpha}(\rho^{*})\text{d}G_{1}$ .

Finally, we point out another interesting property of SLR model (12)–(14). If $\rho$ is fixed as $\rho^{\ell}=n_{2}/n$ , the proportion of $y=1$ in the labeled sample, then $\tilde{\alpha}(\rho^{\ell})=(n_{2}+n_{3}\rho^{\ell})/N=n_{2}/n$ . In this case, the conditional probability (14) reduces to a constant, and the objective function $\kappa\{\rho^{\ell},\beta,\tilde{\alpha}(\rho^{\ell})\}$ can be easily shown to be equivalent to the profile log-likelihood of $\beta$ derived from (4) in the exponential tilt model based on the labeled data only or equivalently the log-likelihood of $(\beta_{0}^{c},\beta_{1})$ as (2) from logistic regression based on the labeled data only, after the intercept shift $\beta_{0}^{c}=\beta_{0}+\log(n_{2}/n_{1})$ . We show that the MLE $\hat{\beta}$ from ETM model (5)–(8) or equivalently SLR model (12)–(14) is asymptotically more efficient than that from logistic regression based on the labeled data only.

Proposition 3.

Denote by $\hat{\beta}^{\ell}$ the estimator of $\beta$ obtained by maximizing $\kappa\{\rho^{\ell},\beta,\tilde{\alpha}(\rho^{\ell})\}$ or equivalently by logistic regression based on the labeled data only. Then the asymptotic variance matrix of the MLE $\hat{\beta}$ from ETM model (5)–(8) is no greater (in the usual order on positive-definite matrices) than that of $\hat{\beta}^{\ell}$ under standard regularity conditions.

3.3 Regularized estimation and EM algorithm

The results in Section 3.2 provide theoretical support for the use of the objective functions $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ and $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ . In real applications, the MLE $(\hat{\rho},\hat{\beta})$ may not behave satisfactorily as predicted by standard asymptotic theory for various reasons. The labeled sample size may not be sufficiently large. The dimension of the feature vector or the complexity of functions of features may be too high, compared with the labeled and unlabel data sizes. Therefore, we propose regularized estimation by adding suitable penalties to the objective functions.

For the coefficient vector $\beta_{1}$ , we employ a ridge penalty $\lambda\|\beta_{1}\|_{2}^{2}$ , although alternative penalties can also be allowed including a Lasso penalty. For the mixture proportion $\rho$ , we use a penalty in the form of the log density of a Beta distribution, $\tau_{1}\log(1-\rho)+\tau_{2}\log\rho$ , where $\tau_{1}=\gamma(1-\rho^{0})n_{3}/N$ and $\tau_{2}=\gamma\rho^{0}n_{3}/N$ for a “center” $\rho^{0}\in(0,1)$ and a “scale” $\gamma\geq 0$ . This choice is motivated by conceptual and computational simplicity in the EM algorithm to be discussed. Combining these penalties with $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ gives the following penalized objective function

[TABLE]

Similarly, the penalized objective function based on $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ is

[TABLE]

Maximization of (15) or (16) will be called profile or direct SLR respectively. The two methods in general lead to different estimates of $(\rho,\beta)$ when $\gamma>0$ , although they can be shown to be equivalent similarly as in Proposition 1 when $\gamma=0$ . In fact, as $\gamma\to\infty$ (i.e., $\rho$ is fixed as $\rho^{0}$ ), the estimator of $\beta$ from profile SLR is known to asymptotically more efficient than from direct SLR (Tan 2009).

We construct EM algorithms (Dempster et al. 1977) to numerically maximize (15) and (16). Of particular interest is that these algorithms shed light on the effect of the regularization introduced. Various other optimization techniques can also be exploited, because $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ is directly of a closed form, and $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ is defined only after univariate minimization in $\alpha$ .

We describe some details about the EM algorithm for profile SLR. See the Supplement (Section III) for the corresponding algorithm for direct SLR. We return to the nonparametric log-likelihood (9) and introduce the following data augmentation. For $j=1,2,3$ , let $u_{ji}\sim\mbox{Bernoulli}\,(\rho_{j})$ such that $(x_{ji}|u_{ji}=0)\sim G_{0}$ and $(x_{ji}|u_{ji}=1)\sim G_{1}$ . Recall that $\rho_{1}=0$ and $\rho_{2}=1$ and hence $u_{1i}=0$ and $u_{2i}=1$ fixed. Denote the penalty term in (15) or (16) as $\mbox{pen}(\rho,\beta)$ .

E-step. The expectation of the augmented objective given the current estimates $(\rho^{(t)},\beta^{(t)})$ is

[TABLE]

where $\text{E}^{(t)}u_{ji}=\rho^{(t)}_{j}\exp(\beta^{(t)}_{0}+\beta_{1}^{(t){\mathrm{\scriptscriptstyle T}}}x_{ji})/\{1-\rho^{(t)}_{j}+\rho^{(t)}_{j}\exp(\beta^{(t)}_{0}+\beta_{1}^{(t){\mathrm{\scriptscriptstyle T}}}x_{ji})\}$ .

M-step. The next estimates $(\rho^{(t+1)},\beta^{(t+1)})$ are obtained as a maximizer of the expected objective (17) with $G_{0}$ profiled out, that is, $\text{pQ}^{(t)}(\rho,\beta)=\max_{G_{0}}Q^{(t)}(\rho,\beta,G_{0})$ over all possible $G_{0}$ which is a probability measure supported on the pooled data $\{x_{ji}:i=1,\ldots,n_{j},j=1,2,3\}$ with $\int\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x)\text{d}G_{0}=1$ . In correspondence to $\kappa(\rho,\beta,\alpha)$ , denote

[TABLE]

Instead of maximizing $\text{pQ}^{(t)}(\rho,\beta)$ directly, we find a simple scheme for computing $(\rho^{(t+1)},\beta^{(t+1)})$ .

Proposition 4.

Let

[TABLE]

If and only if $\beta^{(t+1)}$ is a local (or global) maximizer of $\kappa_{Q}^{(t)}(\rho^{(t+1)},\beta,\alpha^{(t+1)})$ , then $(\rho^{(t+1)},\beta^{(t+1)})$ is a local (or respectively global) maximizer of $\text{pQ}^{(t)}(\rho,\beta)$ .

Proposition 4 is useful both computationally and conceptually. First, $\rho^{(t+1)}$ is of a closed form, as a weighted average, with the weight depending on the scale $\gamma$ , between the prior center $\rho^{0}$ and the empirical estimate $n_{3}^{-1}\sum_{i=1}^{n_{3}}\text{E}^{(t)}u_{ji}$ , which would be obtained with $\gamma\to\infty$ or $\gamma=0$ respectively. Moreover, $\beta^{(t+1)}$ can be equivalently computed by maximizing the objective function

[TABLE]

which is concave in $\beta$ and of a similar form to the log-likelihood (2) with a ridge penalty for logistic regression. Each imputed probability $\text{E}^{(t)}u_{ji}$ serves as a pseudo response.

In our implementation, the prior center $\rho^{0}$ is fixed as $\rho^{\ell}=n_{2}/n$ , the proportion of $y=1$ in the labeled sample, and the scales $(\lambda,\gamma)$ are treated as tuning parameters, to be selected by cross validation. Numerically, this procedure allows an adaptive interpolation between the two extremes: a fixed choice $\rho^{\ell}$ or an empirical estimate by maximum likelihood. For direct SLR (but not profile SLR), our adaptive procedure reduces to and hence accommodates logistic regression with labeled data only at one extreme with $\gamma\to\infty$ . See the Supplement (Section III) for further discussion.

4 Related work

There is a vast literature on semi-supervised learning. See, for example, Chapelle et al. (2006) and Zhu (2008). For space limitation, we only discuss directly related work to ours.

Generative models and EM. A generative model can be postulated for $(y,x)$ jointly such that $p(y,x;\rho,\theta)=p(y;\rho)p(x|y;\theta)$ , where $\rho$ denotes the label proportion and $\theta$ denotes the parameters associated with the feature distributions given labels (e.g., Nigam et al. 2000). In our notation, a generative model corresponds to Eqs (5)–(7), but with both $G_{0}$ and $G_{1}$ parametrically specified. For training by EM algorithms, the expected objective in the E-step is similar to $Q^{(t)}(\rho,\beta,G_{0})$ in (17), except that $G_{k}(\{x_{ji}\})$ is replaced by $p(x_{ji}|y_{ji}=k;\theta)$ for $k=0$ or 1. The performance of generative modeling can be sensitive to whether the model assumptions are correct or not (Cozman et al. 2003). In this regard, our approach based on ETM models is attractive in only specifying a parametric form (8) for the density ratio between $G_{0}$ and $G_{1}$ while leaving the distribution $G_{0}$ nonparametric.

Logistic regression and EM. There are various efforts to extend logistic regression in an EM-style for semi-supervised learning. Notably, Amini & Gallinari (2002) proposed a classification EM algorithm using logistic regression (1), which can be described as follows:

•

E-step: Compute $\text{E}^{(t)}u_{3i}=\{1+\exp(-\beta^{c(t)}_{0}-\beta_{1}^{(t){\mathrm{\scriptscriptstyle T}}}x_{3i})\}$ . Fix $\text{E}^{(t)}u_{1i}=0$ and $\text{E}^{(t)}u_{2i}=1$ .

•

C-step: Let $u^{(t)}_{3i}=1$ if $\text{E}^{(t)}u_{3i}\geq.5$ and 0 otherwise. Fix $u^{(t)}_{1i}=0$ and $u^{(t)}_{2i}=1$ .

•

M-step: Compute $(\beta^{c(t+1)}_{0},\beta_{1}^{(t+1)})$ by maximizing the objective $\sum_{j=1}^{3}\sum_{i=1}^{n_{j}}[u^{(t)}_{ji}(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})-\log\{1+\exp(\beta^{c}_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{ji})\}]$ .

Although convergence of classification EM was studied for clustering (Celeux & Govaert 1992), it seems unclear what objective function is optimized by the preceding algorithm. A worrisome phenomenon we notice is that if soft classification is used instead of hard classification, then the algorithm merely optimizes the log-likelihood of logistic regression with the labeled data only. By comparing (19) and (20), this modified algorithm can be shown to reduce to our EM algorithm with $\rho^{(t)}$ and $\alpha^{(t)}$ clamped at $\rho^{\ell}=n_{2}/n$ , the proportion of $y=1$ in the labeled sample.

Proposition 5.

If the objective in the M-step is modified with $u^{(t)}_{ji}$ replaced by $\text{E}^{(t)}u_{ji}$ as

[TABLE]

then $(\beta^{c(t)}_{0},\beta_{1}^{(t)})$ converges as $t\to\infty$ to MLE of logistic regression based on the labeled data only.

We notice that the conclusion also holds if (20) is replaced by the cost function proposed in Wang et al. (2009), Eq (2), when the logistic loss is used as the cost function on labeled data.

Regularized methods. Various methods have been proposed by introducing a regularizer depending on unlabeled data to the log-likelihood of logistic regression with labeled data. Examples include entropy regularization (Grandvalet & Bengio 2005), expectation regularization (Mann & McCallum 2007), and graph-based priors (Krishnapuram et al. 2005). An important difference from our methods is that these penalized objective functions seem to be Fisher consistent only when they reduce to the log-likelihood of logistic regression with labeled data only alone, regardless of unlabeled data. For another difference, the class proportions in unlabeled data are implicitly assumed to be the same as in labeled data in entropy regularization, and need to be explicitly estimated from labeled data or external knowledge in the case of label regularization (Mann & McCallum 2007).

5 Numerical experiments

We report experiments on 15 benchmark datasets including 11 UCI datasets and 4 SSL benchmark datasets. We compare our methods, profile SLR (pSLR) and direct SLR (dSLR), with 2 supervised methods, ridge logistic regression (RLR) and SVM, and 2 semi-supervised methods, entropy regularization (ER) (Grandvalet & Bengio 2005) and transductive SVM (TSVM) (Joachims 1999). For each method, only linear predictors are studied. All tuning parameters are selected by 5-fold cross validation. See the Supplement (Section V) for details about the datasets and implementations.

For each dataset except SPAM, a training set is obtained as follows: labeled data are sampled for a certain size (25 or 100) and fixed class proportions and then unlabeled data are sampled such that the labeled and unlabeled data combined are 2/3 of the original dataset. The remaining 1/3 of the dataset is used as a test set. For SPAM, the preceding procedure is applied to a subsample of size 750 from the original dataset. To allow different class proportions between labeled and unlabeled data, we consider two schemes: the class proportions in the labeled data are close to those of the original dataset (“Homo Prop”), or larger (or smaller) than the latter by an odds ratio of 4 (“Flip Prop”) if the odds of positive versus negative labels is $\leq 1$ (or respectively $>1$ ) in the original dataset. Hence the class balance constraint as used in TSVM is misspecified in the second scheme.

Care is needed to define classifiers on test data. In the Homo Prop scheme, the 4 existing methods are applied as usual, and accordingly the classifiers from our methods are the sign of $\log(n_{2}/n_{1})+\hat{\beta}_{0}+\hat{\beta}_{1}^{\mathrm{\scriptscriptstyle T}}x$ , where $(n_{1},n_{2})$ are the class sizes in the labeled training data. In the Flip Prop scheme, the classifiers from RLR, LR, and SVM are the sign of $-\log(n_{2}/n_{1})+\tilde{\beta}^{c}_{0}+\tilde{\beta}_{1}^{\mathrm{\scriptscriptstyle T}}x$ , and those from our methods are the sign of $\hat{\beta}_{0}+\hat{\beta}_{1}^{\mathrm{\scriptscriptstyle T}}x$ . Hence the intercepts of linear predictors are adjusted by assuming 1:1 class proportions in the test data. This assumption is often invalid in our experiments, but seems neutral when the actual class proportions in test data are unknown. The “linear predictor” is converted by logit from class probabilities for SVM, but this is currently unavailable for TSVM. Alternatively, class weights can be used in SVM, but this technique has not been developed for TSVM.

Table 1 presents the results with labeled data size 100. See the Supplement for those with labeled data size 25 and AUC results. In the Homo Prop scheme, the logistic-type methods, RLR, ER, pSLR, and dSLR, perform similarly to each other, and noticeably better than SVM and TSVM in terms of accuracy achieved within 1% of the highest (in bold). While unstable performances of SVM and TSVM have been previously noticed (e.g., Li & Zhou 2015), such good performances of RLR and ER on these benchmark datasets appear not to have been reported before. In the Flip Prop scheme, our methods, dSLR and pSLR, achieve the best two performances, sometimes with considerable margins of improvement over other methods. In this case, all methods except TSVM are applied with intercept adjustment as described above. Because which proportion scheme holds may be unknown in practice, the results with intercept adjustment in the Homo Prop scheme are reported in the Supplement. Our methods remain to achieve close to the best performance among the methods studied.

6 Conclusion

We develop an extension of logistic regression for semi-supervised learning, with strong support from statistical theory, algorithms, and numerical results. There are various questions of interest for future work. Our approach can be readily extended by employing nonlinear predictors such as kernel representations or neural networks. Further experiments with such extensions are desired, as well as applications to more complex text and image classification.

References

Amini, M.R. & Gallinari, P. (2002) Semi-supervised logistic regression. Proceedings of the 15th European Conference on Artificial Intelligence, 390–394.

Bartlett, P., Jordan, M., & McAuliffe, J. (2006) Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101, 138–156.

Celeux, G. & Govaert, G. (1992) A classification EM algorithm and two stochastic versions. Computational Statistics and Data Analysis, 14, 315–332.

Chapelle, O., Zien, A. & Schölkopf, B. (2006) Semi-Supervised Learning. MIT Press.

Cozman, F., Cohen, I. & Cirelo, M. (2003). Semi-supervised learning of mixture models. Proceedings of the 20th International Conference on Machine Learning, 99–106.

Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–22.

Grandvalet, Y., & Bengio, Y. (2005) Semi-supervised learning by entropy minimization. Advances in Neural Information Processing Systems 17, 529–536.

Joachims, T. (1999) Transductive inference for text classification using support vector machines. Proceedings of the 16th International Conference on Machine Learning, 200–209.

Kiefer, J. & Wolfowitz, J. (1956) Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Statistics, 27, 887–906.

Krishnapuram, B., Williams, D., Xue, Y., Carin, L., Figueiredo, M. & Hartemink, A.J. (2005) On semi-supervised classification. Advances in Neural Information Processing Systems 17, 721–728.

Li, Y.-F. & Zhou, Z.-H. (2015) Towards making unlabeled data never hurt. IEEE Transactions on Pattern analysis and Machine Intelligence, 37, 175–188.

Lin, Y. (2002) Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery, 6, 259–275.

Mann, G.S. & McCallum, A. (2007) Simple, robust, scalable semi-supervised learning via expectation regularization. Proceedings of the 24th International Conference on Machine learning, 593–600.

Manski, C.F. (1988) Analog Estimation Methods in Econometrics, Chapman & Hall.

Nigam, K., McCallum, A.K., Thrun, S. & Mitchell, T. (2000) Text classification from labeled and unlabeled documents using EM. Machine learning, 39, 103–134.

Owen, A.B. (2001) Empirical Likelihood. Chapman & Hall/CRC.

Prentice, R.L. & Pyke, R. (1979) Logistic disease incidence models and case-control studies. Biometrika, 66, 403–411.

Qin, J. (1998) Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85, 619–630.

Qin, J. (1999) Empirical likelihood ratio based confidence intervals for mixture proportions. Annals of Statistics, 27, 1368–1384.

Tan, Z. (2009) A note on profile likelihood for exponential tilt mixture models. Biometrika, 96, 229–236.

Wang, J., Shen, X. & Pan, W. (2009) On efficient large margin semisupervised learning: Method and theory. Journal of Machine Learning Research, 10, 719–742.

Vapnik, V. (1998) Statistical Learning Theory. Wiley-Interscience.

White, H. (1982) Maximum likelihood estimation of misspecified models. Econometrica, 50, 1–25.

Zhu, X.J. (2008) Semi-supervised learning literature survey. Technical Report, University of Wisconsin-Madison, Department of Computer Sciences.

Zou, F., Fine, J.P. & Yandell, B.S. (2002) On empirical likelihood for a semiparametric mixture model. Biometrika, 89, 61–75.

**Supplementary Material for

“Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models”**

[TABLE]

I Introduction

We provide additional material to support the content of the paper. All equation and proposition numbers referred to are from the paper, except S1, S2, etc.

II Illustration

We provide a simple example to highlight comparison between new and existing methods. A labeled sample of size 100 is drawn, where 20 are from bivariate Gaussian, $G_{0}$ , with mean $(-6,-6)$ and diagonal variance matrix $(5^{2},15^{2})$ , and 80 are from bivariate Gaussian, $G_{1}$ , with mean $(6,6)$ and diagonal variance matrix $(5^{2},10^{2})$ . An unlabeled sample of size 1000 is drawn, where $500$ are from $G_{0}$ and $500$ from $G_{1}$ and then the labels are removed. This is similar to the Flip Prop scheme in numerical experiments in Section 5, where the class proportions in unlabeled data differ from those in labeled data. The training set including both labeled and unlabeled data is then rescaled such that the root mean square of each feature is 1, as shown in Figure S1.

Figure S2 (rows 1 to 5) shows the decision lines from, respectively, ridge linear regression (RLR), entropy regularization (ER), SVM, TSVM, and direct SLR (dSLR). In the left column are the decision lines without intercept adjustment (corresponding to an assumption of 1:4 class proportions in test data as in labeled training data), and in the right column are those with intercept adjustment (corresponding to an assumption of 1:1 class proportions in test data as in unlabeled training data), as described in Section 5. In practice, the class proportions in test data may be unknown and hence some assumption is needed.111The assumption of 1:1 class proportions in test data is used to define classifiers in the Flip Prop scheme in Section 5, even though this assumption is violated for a majority of datasets studied (see Table S1). For ease of comparison, the intercept adjustment is directly applied to $w^{\mathrm{\scriptscriptstyle T}}x+b$ for SVM and TSVM, instead of the “linear predictor” converted by logit from class probabilities (if available), which would yield a nonlinear decision boundary. Alternatively, class weights can be used in SVM to account for differences in class proportions between training and test data. But this technique has not been developed for TSVM.

For each method, eight decision lines (black or blue) are plotted, by using 8 values of a tuning parameter. Some of the lines may fall outside the plot region. The blue lines correspond to the least amount of penalization used, that is, smallest $\lambda$ , $\lambda_{e}$ and $\gamma$ and largest $C$ . See Section V later for a description of the tuning parameters involved. For RLR, $\log_{10}(\lambda)$ is varied uniformly from $[-5,-1]$ . For ER, $\lambda_{e}$ is varied uniformly from $0.01$ from 1, while $\lambda$ is fixed at 0, to isolate the effect of entropy regularization. For SVM and TSVM, $\log_{10}(C)$ is varied uniformly from $[-2,2]$ . For TSVM, the parameter $C^{\ast}$ is automatically tuned when using SVM ${}^{\text{light}}$ (Joachims 1999). For dSLR, $\log_{10}(\gamma)$ is varied uniformly from $[-4,0]$ , while $\lambda$ is fixed at 0.

Two oracle lines are drawn in each plot. The red line is computed by logistic regression and the purple line is computed by SVM with $C=1000$ , from an independent labeled sample of size 4000 with 1:4 class proportions (left column) or 1:1 class proportions (right column), which is transformed by the same scale as the original training set. The red and purple oracle lines differ only slightly in the left column, but are virtually identical in the right column. It should be noted that these oracle lines are not the optimal, Bayes decision boundary, because the log density ratio between the classes is linear in $x_{1}$ but nonlinear in $x_{2}$ due to the different variances of $x_{2}$ .

From these plots, we see the following comparison. First, the least penalized line (blue) from our method dSLR is much closer to the oracle lines (red and purple) than those from the other methods, whether or not intercept adjustment is applied. This shows numerical support for Fisher consistency of our method, given the labeled size 100 and unlabeled size 1000 reasonably large compared with the feature dimension 2. On the other hand, in spite of the relatively large labeled size, the lines from non-penalized logistic regression and SVM based on labeled data alone still differ noticeably from the oracle lines. Hence this also shows that our method can exploit unlabeled data together with labeled data to actually achieve a better approximation to the oracle lines.

Second, with suitable choices of tuning parameters, some of the decision lines from existing methods can be reasonably close to the oracle lines. In fact, such cases of good approximation can be found from the supervised methods RLR and SVM, but not from the semi-supervised methods ER and TSVM. This indicates potentially unstable performances of ER and TSVM, particularly in the current setting where the class proportions in unlabeled data differ from those in labeled data. Moreover, SVM seems to perform noticeably worse in the right column, possibly due to intercept adjustment, than in the left column, where the class proportions in test data underlying the oracle lines are identical those in labeled training data (hence a more favorable setting).

III EM algorithm for direct SLR

We present an EM algorithm to numerically maximizer (16) for direct SLR, based on the SLR model defined by (12)–(14). We introduce the following data augmentation. Given the pooled data $\{(z_{i},x_{i}):i=1,\dots,N\}$ , let

[TABLE]

Equivalently, $\{u_{i}:i=1,\ldots,N\}$ can be denoted as $\{u_{ji}:i=1,\ldots,n_{j},j=1,2,3\}$ , such that $u_{ji}|x_{ji}\sim\mbox{Bernoulli}\,[\rho_{j}\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{i})/\{1-\rho_{j}+\rho_{j}\exp(\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x_{i})]$ for $j=1,2,3$ . Similarly as in Section 3.3, $u_{1i}=0$ and $u_{2i}=1$ fixed, because $\rho_{1}=0$ and $\rho_{2}=1$ .

E-step. The expectation of the average penalized log-likelihood from the augmented data, given the current estimates $(\rho^{(t)},\beta^{(t)})$ is, up to an additive constant free of $(\rho,\beta)$ ,

[TABLE]

where $\text{E}^{(t)}u_{ji}=\rho^{(t)}\exp({\beta_{0}}^{(t)}+\beta_{1}^{(t){\mathrm{\scriptscriptstyle T}}}x_{ji})/[1-\rho^{(t)}+\rho^{(t)}\exp({\beta_{0}}^{(t)}+\beta_{1}^{(t){\mathrm{\scriptscriptstyle T}}}x_{ji})]$ .

M-step. The next estimates $(\rho^{(t+1)},\beta^{(t+1)})$ are obtained as a maximizer of the expected objective $\tilde{Q}^{(t)}(\rho,\beta)$ . Recall that $\kappa_{Q}^{(t)}(\rho,\beta,\alpha)$ defined in Section 3.3 is

[TABLE]

It directly follows that $\tilde{Q}^{(t)}(\rho,\beta)=\kappa_{Q}^{(t)}\{\rho,\beta,\tilde{\alpha}(\rho)\}$ up to an additive constant. Therefore, the expected objective $\tilde{Q}^{(t)}(\rho,\beta)$ is related to $\text{pQ}^{(t)}(\rho,\beta)=\kappa_{Q}^{(t)}\{\rho,\beta,\hat{\alpha}(\beta)\}$ in the profile method, in a similar manner as the average log-likelihood $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ in the SLR model is related to the average profile log-likelihood $\text{pl}(\rho,\beta)$ in the ETM model before data augmentation.

Unfortunately, when $\rho$ is penalized with $\gamma>0$ , there is no simple, closed-form expression for computing $\rho^{(t+1)}$ as in Proposition 4. Nevertheless, we show that $\rho^{(t+1)}$ can be obtained as a solution to a simple equation, independently of $\beta^{(t+1)}$ .

Proposition S1.

The estimate $\tilde{\rho}=\rho^{(t+1)}$ satisfies

[TABLE]

where $\psi(\tilde{\rho})=1-n_{3}\tilde{\rho}(1-\tilde{\rho})/\{N\tilde{\alpha}(\tilde{\rho})(1-\tilde{\alpha}(\tilde{\rho}))\}\in(0,1)$ because $\tilde{\alpha}(\tilde{\rho})(1-\tilde{\alpha}(\tilde{\rho}))>(n_{3}/N)\tilde{\rho}(1-\tilde{\rho})$ for any $\tilde{\rho}\in(0,1)$ as shown in the proof of Proposition 1.

The formula (S2) shows that $\tilde{\rho}=\rho^{(t+1)}$ implicitly remains a weighted average of the prior center $\rho^{0}$ and the empirical estimate $n_{3}^{-1}\sum_{i=1}^{n_{3}}\text{E}^{(t)}u_{3i}$ , with the weight depending on $\gamma$ . If $\gamma=0$ , then $\rho^{(t+1)}$ reduces to $n_{3}^{-1}\sum_{i=1}^{n_{3}}\text{E}^{(t)}u_{3i}$ and hence the EM iterations $(\rho^{(t)},\beta^{(t)})$ coincide with those for profile SLR in Section 3.3. If $\gamma\to\infty$ , then $\rho^{(t+1)}$ becomes fixed at $\rho^{0}$ and then $\beta^{(t)}$ converges to a maximizer of $\kappa\{\rho^{0},\beta,\tilde{\alpha}(\rho^{0})\}-\lambda\|\beta_{1}\|^{2}_{2}$ , the ridge estimator of $\beta$ in the SLR model (12)–(14) with $\rho=\rho^{0}$ fixed. When $\rho^{0}$ is set to $\rho^{\ell}=n_{2}/n$ , this estimator of $\beta$ is identical to that from ridge logistic regression with labeled data only, except for an intercept shift.

In contrast, if $\rho=\rho^{0}$ is fixed in the EM algorithm for profile SLR, then $\beta^{(t)}$ converges to a maximizer of $\kappa\{\rho^{0},\beta,\hat{\alpha}(\beta)\}-\lambda\|\beta_{1}\|^{2}_{2}$ , the ridge estimator of $\beta$ in the ETM model (5)–(8).

IV Technical details

IV.1 Proof of Proposition 1

By some abuse of notation, denote $\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x$ as $\beta^{\mathrm{\scriptscriptstyle T}}x$ . Let $\mathcal{R}$ be a fixed open set of $(\rho,\beta)$ . First, suppose that $(\tilde{\rho},\tilde{\beta})$ is a maximizer of $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ over $\mathcal{R}$ . Then $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}\leq\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}\leq\kappa(\tilde{\rho},\tilde{\beta},\tilde{\alpha}(\tilde{\rho}))$ for any $(\rho,\beta)\in\mathcal{R}$ . Denote $\tilde{\alpha}=\tilde{\alpha}(\tilde{\rho})$ . To prove that $(\tilde{\rho},\tilde{\beta})$ is a maximizer of $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ over $\mathcal{R}$ , we show that $\tilde{\alpha}$ is a minimizer of $\kappa(\tilde{\rho},\tilde{\beta},\alpha)$ , which then implies that $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ achieves a maximum value $\kappa(\tilde{\rho},\tilde{\beta},\tilde{\alpha})$ at $(\tilde{\rho},\tilde{\beta})$ . Because $\kappa(\tilde{\rho},\tilde{\beta},\alpha)$ is convex in $\alpha$ , it suffices to show $A=0$ , where

[TABLE]

Because $(\tilde{\rho},\tilde{\beta})$ is a maximizer of $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ , the stationary condition in $(\rho,\beta_{0})$ yields

[TABLE]

where $\tilde{\rho}_{1}=0$ , $\tilde{\rho}_{2}=1$ , and $\tilde{\rho}_{3}=\tilde{\rho}$ . Eq (S4) is equivalent to

[TABLE]

Summing Eq (S3) multiplied by $\tilde{\rho}(1-\tilde{\rho})$ and Eq (S5) gives

[TABLE]

or equivalently

[TABLE]

Because $\tilde{\alpha}=\sum_{j=1}^{3}(n_{j}/N)\tilde{\rho}_{j}$ and $t(1-t)$ is concave in $t$ , Jensen’s inequality implies that $\tilde{\alpha}(1-\tilde{\alpha})\geq\sum_{j=1}^{3}(n_{j}/N)\tilde{\rho}_{j}(1-\tilde{\rho}_{j})=(n_{3}/N)\tilde{\rho}(1-\tilde{\rho})$ . The inequality holds strictly, $\tilde{\alpha}(1-\tilde{\alpha})>(n_{3}/N)\tilde{\rho}(1-\tilde{\rho})$ , because $\tilde{\rho}_{1}=0\not=\tilde{\rho}_{2}=1$ . Hence $A=0$ .

Next suppose that $(\hat{\rho},\hat{\beta})$ is a maximizer of $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ over $\mathcal{R}$ . Denote $\hat{\alpha}=\hat{\alpha}(\hat{\beta})$ . We show that $\hat{\alpha}=\tilde{\alpha}(\hat{\rho})=\sum_{j=1}^{3}(n_{j}/N)\hat{\rho}_{j}$ . Because $(\hat{\rho},\hat{\beta},\hat{\alpha})$ is a solution to the saddle-point problem (11), the stationary condition in $\alpha$ or equivalently Eq (10) gives

[TABLE]

The stationary condition in $(\rho,\beta_{0})$ yields

[TABLE]

where $\hat{\rho}_{1}=0$ , $\hat{\rho}_{2}=1$ , and $\hat{\rho}_{3}=\hat{\rho}$ . Eq (S7) implies

[TABLE]

Eq (S8) is equivalent to

[TABLE]

Combining Eq (S6) multiplied by $1-\hat{\alpha}$ and summing Eq (S9) over $j=1,2,3$ and Eq (S10) shows $1-\hat{\alpha}=\sum_{j=1}^{3}(n_{j}/N)(1-\hat{\rho}_{j})$ , that is, $\hat{\alpha}=\sum_{j=1}^{3}(n_{j}/N)\hat{\rho}_{j}$ . Let $(\tilde{\rho},\tilde{\beta})$ be a maximizer of $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ over $\mathcal{R}$ . The preceding proof shows that $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ achieves the same maximum value $\kappa(\tilde{\rho},\tilde{\beta},\tilde{\alpha}(\tilde{\rho}))$ as does $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ over $\mathcal{R}$ . Hence $\kappa(\hat{\rho},\hat{\beta},\hat{\alpha})$ as the maximum value of $\kappa\{\rho,\beta,\hat{\alpha}(\beta)\}$ is also the maximum value of $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ over $\mathcal{R}$ . Because $\hat{\alpha}=\tilde{\alpha}(\hat{\rho})$ and $\kappa(\hat{\rho},\hat{\beta},\hat{\alpha})=\kappa(\hat{\rho},\hat{\beta},\tilde{\alpha}(\hat{\rho}))$ , this shows that $(\hat{\rho},\hat{\beta})$ is a maximizer of $\kappa\{\rho,\beta,\tilde{\alpha}(\rho)\}$ over $\mathcal{R}$ .

IV.2 Proof of Proposition 2

Denote

[TABLE]

First, we show $\kappa^{*}\{\rho,h,\tilde{\alpha}(\rho)\}\leq\kappa^{*}\{\rho^{*},h^{*},\tilde{\alpha}(\rho^{*})\}$ for any $\rho\in(0,1)$ and $h(x)$ . By direct calculation, notice that up to an additive constant,

[TABLE]

where $\rho^{*}_{1}=0$ , $\rho^{*}_{2}=1$ , and $\rho^{*}_{3}=\rho^{*}$ . Hence

[TABLE]

where $\mbox{KL}(q^{*}\|q)=\sum_{j}q^{*}_{j}\log(q^{*}_{j}/q_{j})$ is the Kullback–Leibler (KL) divergence between two probability vectors $(q^{*}_{j})$ and $(q_{j})$ .

Next we show that $\min_{\alpha\in(0,1)}\kappa^{*}(\rho^{*},h^{*},\alpha)=\kappa^{*}\{\rho^{*},h^{*},\tilde{\alpha}(\rho^{*})\}$ , that is, $\kappa^{*}(\rho^{*},h^{*},\alpha)\geq\kappa^{*}\{\rho^{*},h^{*},\tilde{\alpha}(\rho^{*})\}$ for any $\alpha\in(0,1)$ . By direct calculation, we obtain

[TABLE]

where the left hand side is the KL divergence between two probability distributions $\{1-\tilde{\alpha}(\rho^{*})+\tilde{\alpha}(\rho^{*})\exp(h^{*}(x))\}\text{d}G_{0}(x)$ and $\{1-\alpha+\alpha\exp(h^{*}(x))\}\text{d}G_{0}(x)$ .

IV.3 Proof of Proposition 3

By definition, $\hat{\beta}$ is a maximizer of $\text{pl}(\beta)=\max_{\rho\in(0,1)}\text{pl}(\rho,\beta)$ . By abuse of notation, rewrite $(1,x^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$ as $x$ and hence $\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x$ as $\beta^{\mathrm{\scriptscriptstyle T}}x$ . Denote the log-likelihood, after rescaling, for logistic regression based on labeled data only as

[TABLE]

where $\rho_{1}=0$ and $\rho_{2}=1$ as before. The goal is to compare the asymptotic efficiency of $\hat{\beta}$ and $\hat{\beta}^{\ell}$ . We use Lemmas S1–S3 presented later in the subsection.

For notational simplicity, assume that $n_{j}/N$ is fixed as a constant $0<\eta_{j}<1$ as $N\rightarrow\infty$ . The results can also be extended to the case where $n_{j}/N$ tends to a constant $0<\eta_{j}<1$ , as in previous asymptotic analysis (Qin 1999). Unless otherwise stated, $(\rho,\beta)$ are evaluated at the true values $(\rho^{*},\beta^{*})$ , and $\alpha$ is evaluated at $\alpha^{*}=\sum_{j=1}^{m}\eta_{j}\rho^{*}_{j}$ , where $\rho^{*}_{1}=0$ , $\rho^{*}_{2}=1$ , and $\rho^{*}_{3}=\rho^{*}$ .

By Lemma S2, it suffices to show

[TABLE]

where $V$ , $G$ , and $H$ are from Lemma S2. For $\partial\text{pl}^{*}(\beta)/\partial\beta$ in Lemma S2, the inequality

[TABLE]

implies

[TABLE]

Substituting the result of Lemma S3 into Eq (S12) yields Eq (S11).

In the following, we present the three lemmas used above. See Sections IV.4–IV.6 for proofs. Denote by $\kappa$ the function $\kappa(\rho,\beta,\alpha)$ . As above, $(\rho,\beta,\alpha)$ are evaluated at $(\rho^{*},\beta^{*},\alpha^{*})$ .

Lemma S1.

(i) As $N\rightarrow\infty$ , we have

[TABLE]

in probability, where

[TABLE]

(ii) Denote $\delta=\sum_{j=1}^{3}\eta_{j}{\rho_{j}^{\ast}}^{2}-{\alpha^{\ast}}^{2}$ . As $N\rightarrow\infty$ , $\sqrt{N}(\partial\kappa/\partial\beta^{\mathrm{\scriptscriptstyle T}},\partial\kappa/\partial\rho,\partial\kappa/\partial\alpha)^{\mathrm{\scriptscriptstyle T}}$ converges to multivariate normal with mean 0 and variance matrix

[TABLE]

Lemma S2.

(i) Under standard regularity conditions, $\sqrt{N}(\hat{\beta}-\beta^{\ast})$ converges in distribution to $N(0,V^{-1})$ , with $V=\text{var}\{\sqrt{N}\partial\text{pl}^{*}(\beta)/\partial\beta\}=S_{11}-s_{22}^{-1}S_{12}S_{21}-s_{33}^{-1}S_{13}S_{31}$ , where

[TABLE]

(ii) Under standard regularity conditions, $\sqrt{n}(\hat{\beta}^{\ell}-\beta^{\ast})$ converges in distribution to $N(0,H^{-1}GH^{-1})$ , with

[TABLE]

where

[TABLE]

Lemma S3.

The inner product of $\partial\text{pl}^{\ast}(\beta)/\partial\beta$ and $\partial\kappa_{\ell}(\beta)/\partial\beta$ equals $N^{-1}H$ , i.e.,

[TABLE]

IV.4 Proof of Lemma S1

(i) We give the calculation of $S_{11}$ as an example. The remaining elements in $U^{\dagger}$ can be calculated in a similar way. First, direct calculation yields

[TABLE]

Because $\{x_{ji}:i=1,\dots,n_{j}\}$ are independent and identically drawn from

[TABLE]

we obtain

[TABLE]

where the simplification in the second term on the right hand side uses

[TABLE]

(ii) For a vector $x\in\mathbb{R}^{p+1}$ , denote $x^{\otimes 2}=xx^{\mathrm{\scriptscriptstyle T}}$ . We show the derivations of $V^{\dagger}_{11}$ and $V^{\dagger}_{13}$ as examples and the remaining elements in $V^{\dagger}$ can be derived similarly. First, we calculate $V^{\dagger}_{11}$ as

[TABLE]

where

[TABLE]

and

[TABLE]

Hence $V^{\dagger}_{11}=S_{11}-\delta S_{13}S_{31}$ . Second, we calculate $V^{\dagger}_{13}$ as

[TABLE]

where

[TABLE]

and

[TABLE]

Hence $V^{\dagger}_{13}=-\delta S_{13}s_{33}$ .

IV.5 Proof of Lemma S2

(i) Note that $\text{pl}(\beta)=\kappa(\rho,\beta,\alpha)$ with $\rho=\hat{\rho}(\beta)$ and $\alpha=\hat{\alpha}(\beta)$ satisfying $\partial\kappa(\rho,\beta,\alpha)/\partial\rho=0$ and $\partial\kappa(\rho,\beta,\alpha)/\partial\alpha=0$ . By implicit differentiation, the gradient and Hessian of $\text{pl}(\beta)$ are

[TABLE]

where $\kappa(\rho,\beta,\alpha)$ is treated as $\kappa(\beta,\phi)$ with $\phi=(\rho,\alpha)^{\mathrm{\scriptscriptstyle T}}$ and $\hat{\phi}(\beta)=\{\hat{\rho}(\beta),\hat{\alpha}(\beta)\}^{\mathrm{\scriptscriptstyle T}}$ .

We use similar arguments as in the proof of Proposition 2 in Tan (2009). Write $U^{\dagger}$ as a $2\times 2$ block matrix

[TABLE]

where $\Sigma_{00}$ is the right-bottom $2\times 2$ diagonal matrix with diagonal elements $s_{22}$ and $s_{33}$ . By the asymptotic theory of M-estimators, the equation $0=\partial\kappa/\partial\phi\rvert_{\beta=\beta^{\ast}}$ admits a solution $\hat{\phi}(\beta^{\ast})=\phi^{\ast}+o_{p}(N^{-1/2})$ with $\phi^{*}=(\rho^{*},\alpha^{*})^{\mathrm{\scriptscriptstyle T}}$ . More specifically,

[TABLE]

By a Taylor expansion of ( $\partial\text{pl}/\partial\beta)(\beta^{\ast})$ in Eq (S13), with $\hat{\phi}(\beta^{\ast})$ around $\phi^{\ast}$ , we find

[TABLE]

By the law of large numbers, $(\partial^{2}\kappa/\partial\beta\partial\phi^{\mathrm{\scriptscriptstyle T}})(\beta^{*},\phi^{*})$ and $(\partial^{2}\kappa/\partial\phi\partial\phi^{\mathrm{\scriptscriptstyle T}})(\beta^{*},\phi^{*})$ converge in probability to $\Sigma_{10}$ and $\Sigma_{00}$ respectively as $N\rightarrow\infty$ . With $\Sigma_{10}\Sigma_{00}^{-1}=(s_{22}^{-1}S_{12},s_{33}^{-1}S_{13})$ , we have $(\partial\text{pl}/\partial\beta)(\beta^{*})=(\partial\text{pl}^{*}/\partial\beta)(\beta^{*})+o_{p}(N^{-1/2})$ . Then, as $N\rightarrow\infty$ , $\sqrt{N}(\partial\text{pl}/\partial\beta)(\beta^{*})$ converges to multivariate normal with mean zero and variance matrix by Lemma S1(ii),

[TABLE]

The simplification follows because $(I,-\Sigma_{10}\Sigma_{00}^{-1})(S_{13}^{\mathrm{\scriptscriptstyle T}},0,s_{33})^{\mathrm{\scriptscriptstyle T}}=0$ and

[TABLE]

Moreover, by Lemma S1(i) and Eq (S14), $-(\partial^{2}\text{pl}/\partial\beta\partial\beta^{\mathrm{\scriptscriptstyle T}})(\beta^{*})$ converges in probability as $N\rightarrow\infty$ to $U=\Sigma_{11}-\Sigma_{10}\Sigma_{00}^{-1}\Sigma_{01}$ , which is identical to $V=S_{11}-s_{22}^{-1}S_{12}S_{21}-s_{33}^{-1}S_{13}S_{31}$ . Hence $\sqrt{N}(\hat{\beta}-\beta^{\ast})$ converges in distribution to $N(0,V^{-1})$ .

(ii) The result follows from the sandwich variance for M-estimation and direct calculation.

IV.6 Proof of Lemma S3

By Lemma S2, we have

[TABLE]

where the second equality holds because $\kappa_{\ell}$ is based on labeled data $\{x_{1i}\}$ and $\{x_{2i}\}$ only and hence independent of

[TABLE]

It suffices to show that the two inner products on the right-hand side of Eq (S15) are

[TABLE]

The calculation proceeds in a similar way as in the proof of Lemma S1. Because $\partial{\kappa}/\partial{\beta}$ , $\partial{\kappa_{\ell}}/\partial{\beta}$ , and $\partial{\kappa}/\partial{\alpha}$ all have means [math] and $\{x_{ji}:i=1,\ldots,n_{j}\}$ are independent and identically drawn from $P_{j}$ , we have

[TABLE]

where

[TABLE]

For the first inner product, we calculate

[TABLE]

For the second inner product, we calculate

[TABLE]

Putting the foregoing results together, we obtain Eqs (S16) and (S17).

IV.7 Proof of Proposition 4

Similarly as in Proposition 1 in Tan (2009) or Lemma 1, it can be shown by Jensen’s inequality that

[TABLE]

where $\hat{\alpha}(\beta)$ is a minimizer of $\kappa_{Q}^{(t)}(\rho,\beta,\alpha)$ over $\alpha$ , satisfying Eq (10). Then $\rho^{(t+1)}$ is a maximizer of $\text{pQ}^{(t)}(\rho,\beta)$ over $\rho$ , independently of $\beta$ , by direct calculation of the gradient. Hence it suffices to show that if and only if $\beta^{(t+1)}$ is a local (or global) maximizer of $\kappa_{Q}^{(t)}(\rho^{(t+1)},\beta,\alpha^{(t+1)})$ , then it is a local (or respectively global) maximizer of $\kappa_{Q}^{(t)}\{\rho,\beta,\hat{\alpha}(\beta)\}$ .

By some abuse of notation, denote $\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x$ as $\beta^{\mathrm{\scriptscriptstyle T}}x$ . Let $\mathcal{R}$ be a fixed open set of $\beta$ . Suppose that $\tilde{\beta}$ is a maximizer of $\kappa_{Q}^{(t)}(\rho^{(t+1)},\beta,\alpha^{(t+1)})$ over $\mathcal{R}$ . Then $\kappa_{Q}^{(t)}\{\rho^{(t+1)},\beta,\hat{\alpha}(\beta)\}\leq\kappa_{Q}^{(t)}(\rho^{(t+1)},\beta,\alpha^{(t+1)})\leq\kappa_{Q}^{(t)}(\rho^{(t+1)},\tilde{\beta},\alpha^{(t+1)})$ for any $\beta\in\mathcal{R}$ . To prove $\tilde{\beta}$ is a maximizer of $\kappa_{Q}^{(t)}\{\rho^{(t+1)},\beta,\hat{\alpha}(\beta)\}$ over $\mathcal{R}$ , we show that $\alpha^{(t+1)}$ is a minimizer of $\kappa_{Q}^{(t)}(\rho^{(t+1)},\tilde{\beta},\alpha)$ , which then implies that $\kappa_{Q}^{(t)}\{\rho^{(t+1)},\beta,\hat{\alpha}(\beta)\}$ achieves a maximum value $\kappa_{Q}^{(t)}(\rho^{(t+1)},\tilde{\beta},\alpha^{(t+1)})$ at $\tilde{\beta}$ . Because $\tilde{\beta}$ is a maximizer of $\kappa_{Q}^{(t)}(\rho^{(t+1)},\beta,\alpha^{(t+1)})$ , the stationary condition in $\beta_{0}$ yields

[TABLE]

Combined with the definition of $\alpha^{(t+1)}$ in (18), this shows that $\alpha^{(t+1)}$ satisfies

[TABLE]

which is the stationary condition for minimization of $\kappa_{Q}^{(t)}(\rho^{(t+1)},\tilde{\beta},\alpha)$ , convex in $\alpha$ .

Next suppose that $\hat{\beta}$ is a maximizer of $\kappa_{Q}^{(t)}\{\rho^{(t+1)},\beta,\hat{\alpha}(\beta)\}$ over $\mathcal{R}$ . Then $\{\hat{\beta},\hat{\alpha}(\hat{\beta})\}$ is a solution to the saddle-point problem, $\max_{\beta}\min_{\alpha}\kappa_{Q}^{(t)}(\rho^{(t+1)},\beta,\alpha)$ . The stationary condition in $\alpha$ gives

[TABLE]

The stationary condition in $\beta_{0}$ yields

[TABLE]

These two equations together imply that $\hat{\alpha}(\hat{\beta})=N^{-1}\sum_{j=1}^{3}\sum_{i=1}^{n_{j}}\text{E}^{(t)}u_{ji}=\alpha^{(t+1)}$ . Then the stationary condition for $\{\hat{\beta},\hat{\alpha}(\hat{\beta})\}$ to be a saddle point of $\kappa_{Q}^{(t)}(\rho^{(t+1)},\beta,\alpha)$ gives

[TABLE]

Because $\kappa_{Q}^{(t)}(\rho^{(t+1)},\beta,\alpha^{(t+1)})$ is concave in $\beta$ as mentioned in Section 3.3, this implies that $\hat{\beta}$ is a maximizer of $\kappa_{Q}^{(t)}(\rho^{(t+1)},\beta,\alpha^{(t+1)})$ over $\mathcal{R}$ .

IV.8 Proof of Proposition 5

By construction, the estimate $(\beta^{c(t+1)}_{0},\beta_{1}^{(t+1)})$ satisfies the stationary condition for maximization of (20):

[TABLE]

Let $(\beta^{c(\infty)}_{0},\beta_{1}^{(\infty)})$ be the limit of the sequence $(\beta^{c(t)}_{0},\beta_{1}^{(t)})$ as $t\to\infty$ . Then $(\beta^{c(\infty)}_{0},\beta_{1}^{(\infty)})$ satisfies

[TABLE]

because $\text{E}^{(\infty)}u_{ji}=y_{ji}$ for $j=1,2$ , or $\{1+\exp(-\beta^{c(\infty)}_{0}-\beta_{1}^{(\infty){\mathrm{\scriptscriptstyle T}}}x_{ji})\}^{-1}$ if $j=3$ . This is precisely the score equation for the MLE of $(\beta^{c}_{0},\beta_{1})$ in logistic regression based on the labeled data only.

IV.9 Proof of Proposition S1

Rewrite $(1,x^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$ as $x$ and $\beta_{0}+\beta_{1}^{\mathrm{\scriptscriptstyle T}}x$ as $\beta^{\mathrm{\scriptscriptstyle T}}x$ . Denote $\tilde{\alpha}=\tilde{\alpha}(\tilde{\rho})$ and

[TABLE]

Suppose $(\tilde{\rho},\tilde{\beta})$ is the maximizer of $\tilde{Q}^{(t)}(\rho,\beta)$ . The stationary conditions in $(\rho,\beta_{0})$ gives

[TABLE]

where $\tau_{1}=\gamma(1-\rho^{0})n_{3}/N$ and $\tau_{2}=\gamma\rho^{0}n_{3}/N$ . Taking a difference between Eq (S18) multiplied by $\tilde{\rho}(1-\tilde{\rho})$ and Eq (S19) yields

[TABLE]

In addition, Eq (S19) is equivalent to

[TABLE]

Combining Eq (S20), Eq (S21) and the definitions of $\tau_{1}$ and $\tau_{2}$ leads to Eq (S2).

V Experiment details

The 11 UCI datasets are available from https://archive.ics.uci.edu/ml/datasets.php and the 4 SSL benchmark data sets are from http://olivier.chapelle.cc/ssl-book/benchmarks.html. Table S1 gives the statistics of the datasets.

Each dataset is randomly divided into training and test data as described in Section 5. For the training set including labeled and unlabeled data, each feature is standardized to have mean 0 and variance 1. No further standardization is performed during cross validation.

The methods RLR, ER, SVM, and TSVM are implemented using the following computer packages respectively:

•

glmnet, https://cran.r-project.org/web/packages/glmnet/index.html,

•

RSSL, https://cran.r-project.org/web/packages/RSSL/index.html,

•

libsvm, https://www.csie.ntu.edu.tw/~cjlin/libsvm/, and

•

SVM ${}^{\text{light}}$ , http://svmlight.joachims.org/.

Our methods, pSLR and dSLR, are implemented using R. The codes are available from the authors upon request.

For each method, the tuning parameters are selected by 5-fold cross validation over 8 possible values as follows. The search range for each tuning parameter is determined from exploratory experiments.

•

RLR: The objective function for RLR is $n^{-1}\ell(\beta)+\lambda\|\beta_{1}\|_{2}^{2}$ , where $\ell(\beta)$ is the negative likelihood function for logistic regression on labeled data. Possible values for the log ridge parameter $\log_{10}(\lambda)$ are fixed uniformly from $[-5,-1]$ for UCI datasets and from $[-4,0]$ for SSL benchmark datasets.

•

ER: The objective function for ER is $N^{-1}\{\ell(\beta)+\lambda_{e}H(\beta)\}+\lambda\|\beta_{1}\|_{2}^{2}$ where $\ell(\beta)$ is the same as in RLR and $H(\beta)$ is the entropy regularizer on the unlabeled data. Possible values for $\lambda$ are fixed in the same manner as in RLR, and values for the entropy parameter $\lambda_{e}$ are fixed uniformly from $[0,1]$ for all datasets.

•

pSLR and dSLR: Recall that the penalty function is $\mbox{pen}(\rho,\beta)=\lambda\|\beta_{1}\|_{2}^{2}+\gamma(1-\rho^{0})(n_{3}/N)\log(1-\rho)+\gamma\rho^{0}(n_{3}/N)\log\rho.$ Possible values for the ridge parameter $\lambda$ are fixed in the same manner as in RLR and ER, and values for $\log_{10}(\gamma)$ are fixed uniformly from $[-2,2]$ for all datasets.

•

SVM: SVM solves the following optimization problem

[TABLE]

where $\tilde{y}_{i}=2y_{i}-1\in\{-1,1\}$ for $i=1,\ldots,n$ . Possible values for $\log_{10}(C)$ are fixed uniformly from $[-2,2]$ for all datasets.

•

TSVM: TSVM with the class balance constraint solves the following optimization problem

[TABLE]

where $\tilde{y}_{i}\in\{-1,1\}$ is the predicted label for $i=n+1,\ldots,N$ . Possible values for $\log_{10}(C)$ are fixed uniformly from $[-2,2]$ for all datasets. The parameter $C^{\ast}$ is automatically tuned in the implementation of SVM ${}^{\text{light}}$ .

Logistic-type methods, RLR, ER, pSLR, and dSLR, are cross validated over the binomial deviance based on the labeled data, and SVM-type methods, SVM and TSVM, are cross validated over the accuracy. For our methods, the binomial deviance is computed on the CV test set, using the coefficient vector $(\hat{\beta}_{0}+\log(n^{\text{cv}}_{2}/n^{\text{cv}}_{1}),\hat{\beta}_{1}^{\mathrm{\scriptscriptstyle T}})^{\mathrm{\scriptscriptstyle T}}$ , where $(n^{\text{cv}}_{1},n^{\text{cv}}_{2})$ are the class sizes of labeled data in the CV training set. Because the performance measures (binomial deviance and accuracy) are based on labeled data only, the entire set of unlabeled data is used without split in training during CV for semi-supervised methods. In the case of a tie, the smaller $\lambda$ or $C$ will be selected for RLR, SVM and TSM. For ER, pSLR, and dSLR, the smaller $\lambda$ and the larger $\gamma$ or $\lambda_{e}$ will be selected.

VI Additional experiment results

For labeled training data size 100, Table S2 presents the accuracy results where intercept adjustment is applied in the Homo Prop scheme, but not applied in the Flip Prop scheme. Table S3 presents the AUC results, which are not affected by whether intercept adjustment is applied. Comparison between Tables 1 and S2 shows that, with intercept adjustment versus no adjustment, the accuracies of the methods, RLR, ER, pSLR, dSLR, and SVM, are decreased only slightly in the Homo Prop scheme, but become substantially improved in the Flip Prop scheme.

For labeled training data size 25, Table S4 presents the accuracy results similarly as in Table 1, where intercept adjustment is not applied in the Homo Prop scheme, but applied in the Flip Prop scheme. Table S5 presents the accuracy results where intercept adjustment is applied in the Homo Prop scheme, but not applied in the Flip Prop scheme. Table S6 presents the AUC results, which are not affected by whether intercept adjustment is applied.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Semi-supervised Logistic Learning Based on Exponential Tilt Mixture Models

Abstract

1 Introduction

2 Background: logistic regression and exponential tilt model

3 Theory and methods

3.1 Exponential tilt mixture model

Lemma 1**.**

3.2 Semi-supervised logistic regression

Proposition 1**.**

Proposition 2**.**

Proposition 3**.**

3.3 Regularized estimation and EM algorithm

Proposition 4**.**

4 Related work

Proposition 5**.**

5 Numerical experiments

6 Conclusion

References

I Introduction

II Illustration

III EM algorithm for direct SLR

Proposition S1**.**

IV Technical details

IV.1 Proof of Proposition 1

IV.2 Proof of Proposition 2

IV.3 Proof of Proposition 3

Lemma S1**.**

Lemma S2**.**

Lemma S3**.**

IV.4 Proof of Lemma S1

IV.5 Proof of Lemma S2

IV.6 Proof of Lemma S3

IV.7 Proof of Proposition 4

IV.8 Proof of Proposition 5

IV.9 Proof of Proposition S1

V Experiment details

VI Additional experiment results

Lemma 1.

Proposition 1.

Proposition 2.

Proposition 3.

Proposition 4.

Proposition 5.

Proposition S1.

Lemma S1.

Lemma S2.

Lemma S3.