Multi-Label Learning with Global and Local Label Correlation

Yue Zhu; James T. Kwok; Zhi-Hua Zhou

arXiv:1704.01415·cs.LG·April 6, 2017

Multi-Label Learning with Global and Local Label Correlation

Yue Zhu, James T. Kwok, Zhi-Hua Zhou

PDF

Open Access

TL;DR

This paper introduces GLOCAL, a multi-label learning method that simultaneously exploits global and local label correlations, effectively handling both complete and partial label data through latent label representations and manifold optimization.

Contribution

It proposes a novel approach that models both global and local label correlations in multi-label learning, addressing partial label issues and improving correlation estimation.

Findings

01

Effective on full-label data

02

Handles missing labels well

03

Outperforms existing methods

Abstract

It is well-known that exploiting label correlations is important to multi-label learning. Existing approaches either assume that the label correlations are global and shared by all instances; or that the label correlations are local and shared only by a data subset. In fact, in the real-world applications, both cases may occur that some label correlations are globally applicable and some are shared only in a local group of instances. Moreover, it is also a usual case that only partial labels are observed, which makes the exploitation of the label correlations much more difficult. That is, it is hard to estimate the label correlations when many labels are absent. In this paper, we propose a new multi-label approach GLOCAL dealing with both the full-label and the missing-label cases, exploiting global and local label correlations simultaneously, through learning a latent label…

Figures29

Click any figure to enlarge with its caption.

Tables6

Table 1. Table 1: Datasets used in the experiments (“ # # \# instance” is the number of instances, “ # # \# dim” is the feature dimensionality, “ # # \# label” is the total size of the class label set, and “ # # \# label/instance” is the average number of labels possessed by each instance).

	#instance	#dim	#label	#label/instance		#instance	#dim	#label	#label/instance
Arts	5,000	462	26	1.64	Business	5,000	438	30	1.59
Computers	5,000	681	33	1.51	Education	5,000	550	33	1.46
Entertainment	5,000	640	21	1.42	Health	5,000	612	32	1.66
Recreation	5,000	606	22	1.42	Reference	5,000	793	33	1.17
Science	5,000	743	40	1.45	Social	5,000	1,047	39	1.28
Society	5,000	636	27	1.69	Enron	1,702	1,001	53	3.37
Corel5k	5,000	499	374	3.52	Image	2,000	294	5	1.24

Table 2. Table 2: Results for learning with full labels. ↑ ↑ \uparrow ( ↓ ↓ \downarrow ) denotes the larger (smaller) the better. ∙ ∙ \bullet indicates that GLOCAL is significantly better (paired t-tests at 95% significance level).

	Measure	BR	MLLOC	LEML	ML-LRC	GLOCAL
Arts	Rkl ( $↓$ )	0.201 $\pm$ 0.005 $∙$	0.177 $\pm$ 0.013 $∙$	0.170 $\pm$ 0.005 $∙$	0.157 $\pm$ 0.002 $∙$	0.138 $\pm$ 0.002
	Auc ( $↑$ )	0.799 $\pm$ 0.006 $∙$	0.823 $\pm$ 0.013 $∙$	0.833 $\pm$ 0.005 $∙$	0.843 $\pm$ 0.001	0.846 $\pm$ 0.005
	Cvg ( $↓$ )	7.347 $\pm$ 0.196 $∙$	6.762 $\pm$ 0.344 $∙$	6.337 $\pm$ 0.243 $∙$	5.529 $\pm$ 0.037 $∙$	5.347 $\pm$ 0.146
	Ap ( $↑$ )	0.594 $\pm$ 0.006 $∙$	0.606 $\pm$ 0.006 $∙$	0.590 $\pm$ 0.005 $∙$	0.600 $\pm$ 0.007 $∙$	0.619 $\pm$ 0.005
Business	Rkl ( $↓$ )	0.072 $\pm$ 0.005 $∙$	0.055 $\pm$ 0.009 $∙$	0.056 $\pm$ 0.005 $∙$	0.044 $\pm$ 0.002	0.044 $\pm$ 0.002
	Auc ( $↑$ )	0.928 $\pm$ 0.005 $∙$	0.944 $\pm$ 0.008 $∙$	0.945 $\pm$ 0.005 $∙$	0.950 $\pm$ 0.005	0.955 $\pm$ 0.003
	Cvg ( $↓$ )	4.087 $\pm$ 0.268 $∙$	3.265 $\pm$ 0.464 $∙$	3.187 $\pm$ 0.270 $∙$	2.560 $\pm$ 0.059	2.559 $\pm$ 0.169
	Ap ( $↑$ )	0.861 $\pm$ 0.007 $∙$	0.878 $\pm$ 0.011 $∙$	0.867 $\pm$ 0.007 $∙$	0.870 $\pm$ 0.005 $∙$	0.883 $\pm$ 0.004
Computers	Rkl ( $↓$ )	0.146 $\pm$ 0.007 $∙$	0.134 $\pm$ 0.014 $∙$	0.138 $\pm$ 0.004 $∙$	0.107 $\pm$ 0.002	0.107 $\pm$ 0.002
	Auc ( $↑$ )	0.854 $\pm$ 0.007 $∙$	0.866 $\pm$ 0.014 $∙$	0.895 $\pm$ 0.002	0.894 $\pm$ 0.002	0.895 $\pm$ 0.002
	Cvg ( $↓$ )	6.654 $\pm$ 0.236 $∙$	6.224 $\pm$ 0.480 $∙$	6.148 $\pm$ 0.183 $∙$	4.893 $\pm$ 0.142	4.889 $\pm$ 0.058
	Ap ( $↑$ )	0.680 $\pm$ 0.007 $∙$	0.689 $\pm$ 0.009 $∙$	0.669 $\pm$ 0.007 $∙$	0.689 $\pm$ 0.005 $∙$	0.698 $\pm$ 0.004
Education	Rkl ( $↓$ )	0.203 $\pm$ 0.010 $∙$	0.158 $\pm$ 0.021 $∙$	0.145 $\pm$ 0.008 $∙$	0.099 $\pm$ 0.002 $∙$	0.095 $\pm$ 0.002
	Auc ( $↑$ )	0.797 $\pm$ 0.102 $∙$	0.842 $\pm$ 0.022 $∙$	0.859 $\pm$ 0.008 $∙$	0.868 $\pm$ 0.006	0.878 $\pm$ 0.006
	Cvg ( $↓$ )	8.979 $\pm$ 0.487 $∙$	7.381 $\pm$ 0.765 $∙$	6.711 $\pm$ 0.364 $∙$	4.531 $\pm$ 0.104	4.529 $\pm$ 0.206
	Ap ( $↑$ )	0.580 $\pm$ 0.010 $∙$	0.613 $\pm$ 0.004 $∙$	0.596 $\pm$ 0.009 $∙$	0.600 $\pm$ 0.007 $∙$	0.628 $\pm$ 0.009
Entertainment	Rkl ( $↓$ )	0.185 $\pm$ 0.006 $∙$	0.146 $\pm$ 0.013 $∙$	0.154 $\pm$ 0.005 $∙$	0.130 $\pm$ 0.005 $∙$	0.108 $\pm$ 0.004
	Auc ( $↑$ )	0.815 $\pm$ 0.006 $∙$	0.854 $\pm$ 0.013 $∙$	0.852 $\pm$ 0.005 $∙$	0.871 $\pm$ 0.003	0.874 $\pm$ 0.005
	Cvg ( $↓$ )	5.006 $\pm$ 0.160 $∙$	4.293 $\pm$ 0.344 $∙$	4.193 $\pm$ 0.139 $∙$	3.505 $\pm$ 0.125 $∙$	3.114 $\pm$ 0.110
	Ap ( $↑$ )	0.662 $\pm$ 0.009 $∙$	0.670 $\pm$ 0.005 $∙$	0.647 $\pm$ 0.007 $∙$	0.661 $\pm$ 0.012 $∙$	0.681 $\pm$ 0.008
Health	Rkl ( $↓$ )	0.113 $\pm$ 0.001 $∙$	0.093 $\pm$ 0.005 $∙$	0.091 $\pm$ 0.003 $∙$	0.071 $\pm$ 0.003 $∙$	0.065 $\pm$ 0.002
	Auc ( $↑$ )	0.886 $\pm$ 0.003 $∙$	0.907 $\pm$ 0.005 $∙$	0.913 $\pm$ 0.004 $∙$	0.929 $\pm$ 0.009	0.923 $\pm$ 0.007
	Cvg ( $↓$ )	6.193 $\pm$ 0.059 $∙$	5.403 $\pm$ 0.157 $∙$	5.063 $\pm$ 0.128 $∙$	3.751 $\pm$ 0.128	3.858 $\pm$ 0.131
	Ap ( $↑$ )	0.763 $\pm$ 0.002 $∙$	0.777 $\pm$ 0.004 $∙$	0.750 $\pm$ 0.003 $∙$	0.755 $\pm$ 0.006 $∙$	0.782 $\pm$ 0.001
Recreation	Rkl ( $↓$ )	0.197 $\pm$ 0.003 $∙$	0.184 $\pm$ 0.015 $∙$	0.185 $\pm$ 0.001 $∙$	0.170 $\pm$ 0.004 $∙$	0.155 $\pm$ 0.002
	Auc ( $↑$ )	0.802 $\pm$ 0.003 $∙$	0.816 $\pm$ 0.015 $∙$	0.822 $\pm$ 0.002 $∙$	0.833 $\pm$ 0.004 $∙$	0.840 $\pm$ 0.000
	Cvg ( $↓$ )	5.506 $\pm$ 0.089 $∙$	5.268 $\pm$ 0.333 $∙$	5.110 $\pm$ 0.040 $∙$	4.515 $\pm$ 0.045 $∙$	4.431 $\pm$ 0.048
	Ap ( $↑$ )	0.609 $\pm$ 0.005 $∙$	0.620 $\pm$ 0.004 $∙$	0.595 $\pm$ 0.004 $∙$	0.604 $\pm$ 0.003 $∙$	0.625 $\pm$ 0.004
Reference	Rkl ( $↓$ )	0.155 $\pm$ 0.005 $∙$	0.138 $\pm$ 0.008 $∙$	0.137 $\pm$ 0.004 $∙$	0.092 $\pm$ 0.003 $∙$	0.086 $\pm$ 0.003
	Auc ( $↑$ )	0.845 $\pm$ 0.005 $∙$	0.862 $\pm$ 0.008 $∙$	0.872 $\pm$ 0.004 $∙$	0.900 $\pm$ 0.006	0.894 $\pm$ 0.004
	Cvg ( $↓$ )	6.171 $\pm$ 0.219 $∙$	5.514 $\pm$ 0.309 $∙$	5.277 $\pm$ 0.171 $∙$	3.438 $\pm$ 0.133	3.387 $\pm$ 0.118
	Ap ( $↑$ )	0.685 $\pm$ 0.005 $∙$	0.688 $\pm$ 0.003	0.667 $\pm$ 0.003 $∙$	0.667 $\pm$ 0.007 $∙$	0.688 $\pm$ 0.007
Science	Rkl ( $↓$ )	0.197 $\pm$ 0.009 $∙$	0.166 $\pm$ 0.017 $∙$	0.170 $\pm$ 0.005 $∙$	0.131 $\pm$ 0.002 $∙$	0.118 $\pm$ 0.003
	Auc ( $↑$ )	0.802 $\pm$ 0.010 $∙$	0.834 $\pm$ 0.018	0.834 $\pm$ 0.005 $∙$	0.860 $\pm$ 0.003	0.853 $\pm$ 0.010
	Cvg ( $↓$ )	10.189 $\pm$ 0.435 $∙$	8.867 $\pm$ 0.751 $∙$	8.885 $\pm$ 0.197 $∙$	6.704 $\pm$ 0.122 $∙$	6.434 $\pm$ 0.137
	Ap ( $↑$ )	0.568 $\pm$ 0.012 $∙$	0.581 $\pm$ 0.009	0.551 $\pm$ 0.008 $∙$	0.561 $\pm$ 0.009 $∙$	0.580 $\pm$ 0.009
Social	Rkl ( $↓$ )	0.112 $\pm$ 0.001 $∙$	0.094 $\pm$ 0.013 $∙$	0.106 $\pm$ 0.006 $∙$	0.075 $\pm$ 0.005	0.075 $\pm$ 0.005
	Auc ( $↑$ )	0.888 $\pm$ 0.002 $∙$	0.906 $\pm$ 0.013 $∙$	0.894 $\pm$ 0.006 $∙$	0.917 $\pm$ 0.005	0.915 $\pm$ 0.005
	Cvg ( $↓$ )	6.036 $\pm$ 0.125 $∙$	5.147 $\pm$ 0.401 $∙$	5.521 $\pm$ 0.301 $∙$	4.651 $\pm$ 0.102	4.537 $\pm$ 0.258
	Ap ( $↑$ )	0.724 $\pm$ 0.005 $∙$	0.764 $\pm$ 0.008	0.731 $\pm$ 0.005 $∙$	0.719 $\pm$ 0.003 $∙$	0.758 $\pm$ 0.008
Society	Rkl ( $↓$ )	0.204 $\pm$ 0.004 $∙$	0.182 $\pm$ 0.006 $∙$	0.182 $\pm$ 0.007 $∙$	0.142 $\pm$ 0.002 $∙$	0.136 $\pm$ 0.005
	Auc ( $↑$ )	0.796 $\pm$ 0.005 $∙$	0.818 $\pm$ 0.006 $∙$	0.822 $\pm$ 0.008 $∙$	0.840 $\pm$ 0.006	0.844 $\pm$ 0.006
	Cvg ( $↓$ )	8.048 $\pm$ 0.108 $∙$	7.392 $\pm$ 0.216 $∙$	7.438 $\pm$ 0.162 $∙$	5.973 $\pm$ 0.108	5.852 $\pm$ 0.194
	Ap ( $↑$ )	0.610 $\pm$ 0.007 $∙$	0.623 $\pm$ 0.004 $∙$	0.599 $\pm$ 0.006 $∙$	0.605 $\pm$ 0.006 $∙$	0.633 $\pm$ 0.009
Enron	Rkl ( $↓$ )	0.194 $\pm$ 0.006 $∙$	0.169 $\pm$ 0.012 $∙$	0.159 $\pm$ 0.005 $∙$	0.133 $\pm$ 0.004 $∙$	0.125 $\pm$ 0.004
	Auc ( $↑$ )	0.806 $\pm$ 0.006 $∙$	0.831 $\pm$ 0.009 $∙$	0.851 $\pm$ 0.006 $∙$	0.869 $\pm$ 0.004 $∙$	0.877 $\pm$ 0.005
	Cvg ( $↓$ )	23.618 $\pm$ 0.450 $∙$	21.724 $\pm$ 0.950 $∙$	18.531 $\pm$ 0.707 $∙$	16.654 $\pm$ 0.198	16.737 $\pm$ 0.622
	Ap ( $↑$ )	0.575 $\pm$ 0.006 $∙$	0.586 $\pm$ 0.009 $∙$	0.600 $\pm$ 0.004 $∙$	0.591 $\pm$ 0.004 $∙$	0.647 $\pm$ 0.006
Corel5k	Rkl ( $↓$ )	0.271 $\pm$ 0.006 $∙$	0.230 $\pm$ 0.012 $∙$	0.246 $\pm$ 0.004 $∙$	0.170 $\pm$ 0.002	0.173 $\pm$ 0.005
	Auc ( $↑$ )	0.699 $\pm$ 0.006 $∙$	0.757 $\pm$ 0.012 $∙$	0.754 $\pm$ 0.005 $∙$	0.825 $\pm$ 0.005	0.827 $\pm$ 0.005
	Cvg ( $↓$ )	261.99 $\pm$ 3.15 $∙$	201.80 $\pm$ 6.71 $∙$	184.58 $\pm$ 1.72 $∙$	137.31 $\pm$ 2.49	136.91 $\pm$ 3.21
	Ap ( $↑$ )	0.153 $\pm$ 0.001 $∙$	0.182 $\pm$ 0.005 $∙$	0.188 $\pm$ 0.004 $∙$	0.198 $\pm$ 0.003	0.200 $\pm$ 0.004
Image	Rkl ( $↓$ )	0.181 $\pm$ 0.011	0.180 $\pm$ 0.008	0.181 $\pm$ 0.012	0.180 $\pm$ 0.009	0.179 $\pm$ 0.004
	Auc ( $↑$ )	0.812 $\pm$ 0.011	0.810 $\pm$ 0.012	0.786 $\pm$ 0.005 $∙$	0.748 $\pm$ 0.010 $∙$	0.819 $\pm$ 0.009
	Cvg ( $↓$ )	1.004 $\pm$ 0.050	0.975 $\pm$ 0.060	1.000 $\pm$ 0.027	1.000 $\pm$ 0.019	0.975 $\pm$ 0.054
	Ap ( $↑$ )	0.788 $\pm$ 0.008	0.794 $\pm$ 0.010	0.790 $\pm$ 0.008	0.790 $\pm$ 0.010	0.795 $\pm$ 0.007

Table 3. Table 3: Results for learning with full labels on the small clusters (each containing fewer than 5 % percent 5 5\% of the samples). ↑ ↑ \uparrow ( ↓ ↓ \downarrow ) denotes the larger (smaller) the better. ∙ ∙ \bullet indicates that GLOCAL is significantly better (paired t-tests at 95% significance level).

		GLObal	loCAL	GLOCAL			GLObal	loCAL	GLOCAL
Art	Rkl ( $↓$ )	0.137 $\pm$ 0.003 $∙$	0.137 $\pm$ 0.002 $∙$	0.130 $\pm$ 0.005	Bus	Rkl ( $↓$ )	0.040 $\pm$ 0.002	0.040 $\pm$ 0.002	0.040 $\pm$ 0.003
	Auc ( $↑$ )	0.863 $\pm$ 0.003 $∙$	0.863 $\pm$ 0.002 $∙$	0.870 $\pm$ 0.005		Auc ( $↑$ )	0.958 $\pm$ 0.003	0.958 $\pm$ 0.003	0.958 $\pm$ 0.003
	Cvg ( $↓$ )	5.286 $\pm$ 0.046 $∙$	5.286 $\pm$ 0.046 $∙$	5.197 $\pm$ 0.065		Cvg ( $↓$ )	2.529 $\pm$ 0.035	2.528 $\pm$ 0.040	2.528 $\pm$ 0.040
	Ap ( $↑$ )	0.602 $\pm$ 0.013 $∙$	0.602 $\pm$ 0.010 $∙$	0.631 $\pm$ 0.011		Ap ( $↑$ )	0.882 $\pm$ 0.002 $∙$	0.882 $\pm$ 0.002 $∙$	0.886 $\pm$ 0.003
Com	Rkl ( $↓$ )	0.095 $\pm$ 0.002 $∙$	0.095 $\pm$ 0.002 $∙$	0.092 $\pm$ 0.002	Edu	Rkl ( $↓$ )	0.101 $\pm$ 0.002 $∙$	0.101 $\pm$ 0.002 $∙$	0.097 $\pm$ 0.002
	Auc ( $↑$ )	0.905 $\pm$ 0.002 $∙$	0.905 $\pm$ 0.002 $∙$	0.908 $\pm$ 0.001		Auc ( $↑$ )	0.899 $\pm$ 0.002 $∙$	0.899 $\pm$ 0.002 $∙$	0.903 $\pm$ 0.002
	Cvg ( $↓$ )	4.482 $\pm$ 0.032 $∙$	4.486 $\pm$ 0.040 $∙$	4.364 $\pm$ 0.055		Cvg ( $↓$ )	4.803 $\pm$ 0.033 $∙$	4.805 $\pm$ 0.036 $∙$	4.672 $\pm$ 0.051
	Ap ( $↑$ )	0.677 $\pm$ 0.003	0.676 $\pm$ 0.003	0.678 $\pm$ 0.005		Ap ( $↑$ )	0.605 $\pm$ 0.003 $∙$	0.605 $\pm$ 0.003 $∙$	0.624 $\pm$ 0.005
Ent	Rkl ( $↓$ )	0.091 $\pm$ 0.002 $∙$	0.091 $\pm$ 0.002 $∙$	0.086 $\pm$ 0.003	Hea	Rkl ( $↓$ )	0.054 $\pm$ 0.002	0.054 $\pm$ 0.003	0.053 $\pm$ 0.004
	Auc ( $↑$ )	0.909 $\pm$ 0.002 $∙$	0.909 $\pm$ 0.002 $∙$	0.914 $\pm$ 0.002		Auc ( $↑$ )	0.945 $\pm$ 0.003	0.946 $\pm$ 0.003	0.947 $\pm$ 0.003
	Cvg ( $↓$ )	2.817 $\pm$ 0.027 $∙$	2.797 $\pm$ 0.035 $∙$	2.709 $\pm$ 0.059		Cvg ( $↓$ )	3.508 $\pm$ 0.036	3.506 $\pm$ 0.049	3.504 $\pm$ 0.041
	Ap ( $↑$ )	0.748 $\pm$ 0.003 $∙$	0.749 $\pm$ 0.004 $∙$	0.759 $\pm$ 0.006		Ap ( $↑$ )	0.810 $\pm$ 0.004	0.810 $\pm$ 0.004	0.812 $\pm$ 0.006
Rec	Rkl ( $↓$ )	0.124 $\pm$ 0.002 $∙$	0.124 $\pm$ 0.002 $∙$	0.118 $\pm$ 0.002	Ref	Rkl ( $↓$ )	0.060 $\pm$ 0.002 $∙$	0.061 $\pm$ 0.003 $∙$	0.054 $\pm$ 0.004
	Auc ( $↑$ )	0.871 $\pm$ 0.003	0.870 $\pm$ 0.003	0.872 $\pm$ 0.004		Auc ( $↑$ )	0.940 $\pm$ 0.003 $∙$	0.939 $\pm$ 0.004 $∙$	0.946 $\pm$ 0.004
	Cvg ( $↓$ )	3.704 $\pm$ 0.033	3.700 $\pm$ 0.037	3.700 $\pm$ 0.042		Cvg ( $↓$ )	2.552 $\pm$ 0.043 $∙$	2.559 $\pm$ 0.057 $∙$	2.325 $\pm$ 0.060
	Ap ( $↑$ )	0.670 $\pm$ 0.004	0.670 $\pm$ 0.004	0.672 $\pm$ 0.005		Ap ( $↑$ )	0.739 $\pm$ 0.004 $∙$	0.739 $\pm$ 0.004 $∙$	0.783 $\pm$ 0.005
Sci	Rkl ( $↓$ )	0.107 $\pm$ 0.004	0.108 $\pm$ 0.004	0.107 $\pm$ 0.004	Soc	Rkl ( $↓$ )	0.063 $\pm$ 0.002 $∙$	0.063 $\pm$ 0.002 $∙$	0.060 $\pm$ 0.002
	Auc ( $↑$ )	0.893 $\pm$ 0.004	0.892 $\pm$ 0.004	0.893 $\pm$ 0.005		Auc ( $↑$ )	0.930 $\pm$ 0.002 $∙$	0.930 $\pm$ 0.002 $∙$	0.934 $\pm$ 0.002
	Cvg ( $↓$ )	5.937 $\pm$ 0.041 $∙$	5.941 $\pm$ 0.049 $∙$	5.845 $\pm$ 0.054		Cvg ( $↓$ )	3.558 $\pm$ 0.033	3.559 $\pm$ 0.038	3.552 $\pm$ 0.049
	Ap ( $↑$ )	0.608 $\pm$ 0.003	0.608 $\pm$ 0.003	0.610 $\pm$ 0.003		Ap ( $↑$ )	0.797 $\pm$ 0.002	0.797 $\pm$ 0.003	0.798 $\pm$ 0.003
Soci	Rkl ( $↓$ )	0.126 $\pm$ 0.003 $∙$	0.126 $\pm$ 0.005 $∙$	0.113 $\pm$ 0.005	Enr	Rkl ( $↓$ )	0.117 $\pm$ 0.002 $∙$	0.119 $\pm$ 0.003 $∙$	0.105 $\pm$ 0.005
	Auc ( $↑$ )	0.874 $\pm$ 0.003 $∙$	0.874 $\pm$ 0.004 $∙$	0.887 $\pm$ 0.005		Auc ( $↑$ )	0.883 $\pm$ 0.004 $∙$	0.881 $\pm$ 0.004 $∙$	0.895 $\pm$ 0.004
	Cvg ( $↓$ )	5.554 $\pm$ 0.047 $∙$	5.553 $\pm$ 0.053 $∙$	5.208 $\pm$ 0.059		Cvg ( $↓$ )	19.440 $\pm$ 0.833 $∙$	19.372 $\pm$ 0.915 $∙$	17.511 $\pm$ 1.231
	Ap ( $↑$ )	0.670 $\pm$ 0.004 $∙$	0.670 $\pm$ 0.005 $∙$	0.711 $\pm$ 0.005		Ap ( $↑$ )	0.685 $\pm$ 0.005 $∙$	0.673 $\pm$ 0.005 $∙$	0.706 $\pm$ 0.007
Cor	Rkl ( $↓$ )	0.163 $\pm$ 0.002 $∙$	0.163 $\pm$ 0.002 $∙$	0.160 $\pm$ 0.002	Ima	Rkl ( $↓$ )	0.197 $\pm$ 0.003 $∙$	0.199 $\pm$ 0.004 $∙$	0.190 $\pm$ 0.004
	Auc ( $↑$ )	0.837 $\pm$ 0.002 $∙$	0.837 $\pm$ 0.002 $∙$	0.840 $\pm$ 0.002		Auc ( $↑$ )	0.803 $\pm$ 0.003 $∙$	0.801 $\pm$ 0.003 $∙$	0.810 $\pm$ 0.003
	Cvg ( $↓$ )	130.84 $\pm$ 1.01 $∙$	131.13 $\pm$ 1.21 $∙$	128.40 $\pm$ 1.30		Cvg ( $↓$ )	1.064 $\pm$ 0.015 $∙$	1.066 $\pm$ 0.021 $∙$	1.027 $\pm$ 0.027
	Ap ( $↑$ )	0.212 $\pm$ 0.003	0.212 $\pm$ 0.003	0.214 $\pm$ 0.005		Ap ( $↑$ )	0.764 $\pm$ 0.003 $∙$	0.763 $\pm$ 0.004 $∙$	0.771 $\pm$ 0.005

Table 4. Table 4: Recovery results for missing label data on ranking loss(Rkl), average auc(Auc), coverage(Cvg) and average precision(Ap).. ↑ ↑ \uparrow ( ↓ ↓ \downarrow ) denotes the larger (smaller) the better. ∙ ∙ \bullet indicates that the GLOCAL is significantly better (paired t-tests at 95% significance level).

	Measure	$ρ$	MAXIDE	LEML	ML-LRC	GLOCAL		Measure	$ρ$	MAXIDE	LEML	ML-LRC	GLOCAL
Art	Rkl ( $↓$ )	30	0.131 $∙$	0.133 $∙$	0.137 $∙$	0.103	Bus	Rkl ( $↓$ )	30	0.044 $∙$	0.046 $∙$	0.046 $∙$	0.029
	Rkl ( $↓$ )	70	0.083 $∙$	0.090 $∙$	0.083 $∙$	0.074		Rkl ( $↓$ )	70	0.026 $∙$	0.027 $∙$	0.024 $∙$	0.021
	Auc ( $↑$ )	30	0.871 $∙$	0.848 $∙$	0.879 $∙$	0.897		Auc ( $↑$ )	30	0.956 $∙$	0.954 $∙$	0.954 $∙$	0.971
	Auc ( $↑$ )	70	0.918 $∙$	0.912 $∙$	0.910 $∙$	0.928		Auc ( $↑$ )	70	0.974 $∙$	0.973 $∙$	0.974 $∙$	0.979
	Cvg ( $↓$ )	30	5.195 $∙$	5.231 $∙$	5.161 $∙$	4.189		Cvg ( $↓$ )	30	2.550 $∙$	2.622 $∙$	2.622 $∙$	1.830
	Cvg ( $↓$ )	70	3.616 $∙$	3.733 $∙$	3.778 $∙$	3.234		Cvg ( $↓$ )	70	1.742 $∙$	1.783 $∙$	1.746 $∙$	1.477
	Ap ( $↑$ )	30	0.645 $∙$	0.634 $∙$	0.640 $∙$	0.652		Ap ( $↑$ )	30	0.876 $∙$	0.878 $∙$	0.876 $∙$	0.893
	Ap ( $↑$ )	70	0.720	0.720	0.709 $∙$	0.720		Ap ( $↑$ )	70	0.905 $∙$	0.901 $∙$	0.903 $∙$	0.908
Com	Rkl ( $↓$ )	30	0.101 $∙$	0.098 $∙$	0.097 $∙$	0.073	Edu	Rkl ( $↓$ )	30	0.097 $∙$	0.093 $∙$	0.089 $∙$	0.069
	Rkl ( $↓$ )	70	0.059 $∙$	0.063 $∙$	0.061 $∙$	0.052		Rkl ( $↓$ )	70	0.061 $∙$	0.061 $∙$	0.061 $∙$	0.058
	Auc ( $↑$ )	30	0.905 $∙$	0.908 $∙$	0.909 $∙$	0.933		Auc ( $↑$ )	30	0.902 $∙$	0.907 $∙$	0.911 $∙$	0.932
	Auc ( $↑$ )	70	0.947 $∙$	0.943 $∙$	0.945 $∙$	0.955		Auc ( $↑$ )	70	0.938 $∙$	0.938 $∙$	0.940	0.942
	Cvg ( $↓$ )	30	4.627 $∙$	4.586 $∙$	4.565 $∙$	3.511		Cvg ( $↓$ )	30	4.672 $∙$	4.372 $∙$	3.914 $∙$	3.171
	Cvg ( $↓$ )	70	2.912 $∙$	3.100 $∙$	3.095 $∙$	2.586		Cvg ( $↓$ )	70	3.113 $∙$	3.106 $∙$	3.000	2.815
	Ap ( $↑$ )	30	0.709 $∙$	0.700 $∙$	0.705 $∙$	0.726		Ap ( $↑$ )	30	0.653	0.648 $∙$	0.653	0.655
	Ap ( $↑$ )	70	0.787	0.787	0.787	0.787		Ap ( $↑$ )	70	0.711	0.702 $∙$	0.710	0.711
Ent	Rkl ( $↓$ )	30	0.104 $∙$	0.103 $∙$	0.106 $∙$	0.085	Hea	Rkl ( $↓$ )	30	0.060 $∙$	0.057 $∙$	0.054 $∙$	0.041
	Rkl ( $↓$ )	70	0.063	0.063	0.063	0.062		Rkl ( $↓$ )	70	0.037 $∙$	0.036 $∙$	0.032	0.030
	Auc ( $↑$ )	30	0.898 $∙$	0.899 $∙$	0.899 $∙$	0.916		Auc ( $↑$ )	30	0.941 $∙$	0.943 $∙$	0.947 $∙$	0.960
	Auc ( $↑$ )	70	0.940	0.938	0.940	0.940		Auc ( $↑$ )	70	0.964 $∙$	0.964 $∙$	0.968	0.971
	Cvg ( $↓$ )	30	3.058 $∙$	2.994 $∙$	3.022 $∙$	2.512		Cvg ( $↓$ )	30	3.577 $∙$	3.462 $∙$	3.465 $∙$	2.567
	Cvg ( $↓$ )	70	1.987	2.051	2.080	1.957		Cvg ( $↓$ )	70	2.524 $∙$	2.465 $∙$	2.450 $∙$	2.152
	Ap ( $↑$ )	30	0.704	0.698 $∙$	0.698 $∙$	0.704		Ap ( $↑$ )	30	0.796 $∙$	0.794 $∙$	0.798	0.801
	Ap ( $↑$ )	70	0.763 $∙$	0.765	0.765	0.768		Ap ( $↑$ )	70	0.848	0.842 $∙$	0.848	0.848
Rec	Rkl ( $↓$ )	30	0.130 $∙$	0.133 $∙$	0.135 $∙$	0.110	Ref	Rkl ( $↓$ )	30	0.083 $∙$	0.083 $∙$	0.083 $∙$	0.063
	Rkl ( $↓$ )	70	0.078 $∙$	0.080 $∙$	0.080 $∙$	0.068		Rkl ( $↓$ )	70	0.048	0.049	0.049	0.048
	Auc ( $↑$ )	30	0.873 $∙$	0.870 $∙$	0.869 $∙$	0.895		Auc ( $↑$ )	30	0.919 $∙$	0.919 $∙$	0.918 $∙$	0.939
	Auc ( $↑$ )	70	0.925 $∙$	0.923 $∙$	0.920 $∙$	0.934		Auc ( $↑$ )	70	0.955	0.953	0.953	0.955
	Cvg ( $↓$ )	30	3.899 $∙$	3.919 $∙$	4.048 $∙$	3.291		Cvg ( $↓$ )	30	3.436 $∙$	3.392 $∙$	3.372 $∙$	2.520
	Cvg ( $↓$ )	70	2.560 $∙$	2.607 $∙$	2.620 $∙$	2.262		Cvg ( $↓$ )	70	2.039 $∙$	2.103 $∙$	2.195 $∙$	1.972
	Ap ( $↑$ )	30	0.680 $∙$	0.663 $∙$	0.660 $∙$	0.681		Ap ( $↑$ )	30	0.681	0.664 $∙$	0.674	0.679
	Ap ( $↑$ )	70	0.767 $∙$	0.763 $∙$	0.760 $∙$	0.770		Ap ( $↑$ )	70	0.745	0.746	0.746	0.746
Sci	Rkl ( $↓$ )	30	0.110 $∙$	0.111 $∙$	0.110 $∙$	0.086	Soc	Rkl ( $↓$ )	30	0.069 $∙$	0.069 $∙$	0.063 $∙$	0.042
	Rkl ( $↓$ )	70	0.063	0.071 $∙$	0.070 $∙$	0.063		Rkl ( $↓$ )	70	0.041 $∙$	0.040 $∙$	0.040 $∙$	0.026
	Auc ( $↑$ )	30	0.889 $∙$	0.889 $∙$	0.889 $∙$	0.913		Auc ( $↑$ )	30	0.930 $∙$	0.930 $∙$	0.936 $∙$	0.957
	Auc ( $↑$ )	70	0.935	0.928 $∙$	0.923 $∙$	0.935		Auc ( $↑$ )	70	0.964 $∙$	0.959 $∙$	0.966 $∙$	0.973
	Cvg ( $↓$ )	30	6.193 $∙$	6.141 $∙$	6.271 $∙$	4.845		Cvg ( $↓$ )	30	3.865 $∙$	3.920 $∙$	3.304 $∙$	2.443
	Cvg ( $↓$ )	70	3.771	3.914 $∙$	3.878 $∙$	3.751		Cvg ( $↓$ )	70	2.103 $∙$	2.386 $∙$	2.373 $∙$	1.663
	Ap ( $↑$ )	30	0.615	0.613	0.614	0.615		Ap ( $↑$ )	30	0.780 $∙$	0.780 $∙$	0.784 $∙$	0.802
	Ap ( $↑$ )	70	0.689 $∙$	0.647 $∙$	0.650 $∙$	0.691		Ap ( $↑$ )	70	0.854 $∙$	0.865	0.865	0.865
Soci	Rkl ( $↓$ )	30	0.129 $∙$	0.128 $∙$	0.123 $∙$	0.102	Enr	Rkl ( $↓$ )	30	0.091 $∙$	0.115 $∙$	0.085 $∙$	0.075
	Rkl ( $↓$ )	70	0.074	0.081 $∙$	0.073	0.073		Rkl ( $↓$ )	70	0.042	0.060 $∙$	0.040	0.040
	Auc ( $↑$ )	30	0.871 $∙$	0.872 $∙$	0.877 $∙$	0.898		Auc ( $↑$ )	30	0.910 $∙$	0.887 $∙$	0.918 $∙$	0.926
	Auc ( $↑$ )	70	0.926	0.919 $∙$	0.928	0.929		Auc ( $↑$ )	70	0.960	0.942 $∙$	0.962	0.962
	Cvg ( $↓$ )	30	5.557 $∙$	5.459 $∙$	5.167 $∙$	4.496		Cvg ( $↓$ )	30	14.24 $∙$	16.65 $∙$	13.45 $∙$	12.05
	Cvg ( $↓$ )	70	3.641 $∙$	3.824 $∙$	3.608 $∙$	3.442		Cvg ( $↓$ )	70	7.961 $∙$	10.33 $∙$	7.480	7.510
	Ap ( $↑$ )	30	0.646	0.629 $∙$	0.650	0.652		Ap ( $↑$ )	30	0.739	0.711 $∙$	0.739	0.739
	Ap ( $↑$ )	70	0.719	0.717	0.719	0.719		Ap ( $↑$ )	70	0.854	0.842 $∙$	0.855	0.855
Cor	Rkl ( $↓$ )	30	0.226 $∙$	0.214 $∙$	0.206 $∙$	0.185	Ima	Rkl ( $↓$ )	30	0.302 $∙$	0.184 $∙$	0.175	0.173
	Rkl ( $↓$ )	70	0.138 $∙$	0.131 $∙$	0.123	0.125		Rkl ( $↓$ )	70	0.251 $∙$	0.148	0.148	0.148
	Auc ( $↑$ )	30	0.773 $∙$	0.786 $∙$	0.794 $∙$	0.814		Auc ( $↑$ )	30	0.820 $∙$	0.828	0.826	0.828
	Auc ( $↑$ )	70	0.874	0.874	0.874	0.874		Auc ( $↑$ )	70	0.834 $∙$	0.857	0.855	0.855
	Cvg ( $↓$ )	30	204.90 $∙$	182.76 $∙$	178.60 $∙$	153.82		Cvg ( $↓$ )	30	1.493 $∙$	1.104 $∙$	0.967	0.950
	Cvg ( $↓$ )	70	103.63	102.42	102.30	102.30		Cvg ( $↓$ )	70	0.790 $∙$	0.760	0.770	0.760
	Ap ( $↑$ )	30	0.275	0.259 $∙$	0.275	0.275		Ap ( $↑$ )	30	0.739 $∙$	0.776 $∙$	0.775 $∙$	0.785
	Ap ( $↑$ )	70	0.279	0.279	0.279	0.279		Ap ( $↑$ )	70	0.768 $∙$	0.841	0.834	0.841

Table 5. Table 5: Prediction results for missing label data on ranking loss(Rkl), average auc(Auc), coverage(Cvg) and average precision(Ap).. ↑ ↑ \uparrow ( ↓ ↓ \downarrow ) denotes the larger (smaller) the better. ∙ ∙ \bullet indicates that the GLOCAL is significantly better (paired t-tests at 95% significance level).

	Measure	$ρ$	MMLLOC	LEML	ML-LRC	GLOCAL		Measure	$ρ$	MMLLOC	LEML	ML-LRC	GLOCAL
Art	Rkl ( $↓$ )	30	0.225 $∙$	0.204 $∙$	0.184 $∙$	0.144	Bus	Rkl ( $↓$ )	30	0.083 $∙$	0.063 $∙$	0.061 $∙$	0.054
	Rkl ( $↓$ )	70	0.193 $∙$	0.181 $∙$	0.159 $∙$	0.139		Rkl ( $↓$ )	70	0.064 $∙$	0.058 $∙$	0.046	0.046
	Auc ( $↑$ )	30	0.781 $∙$	0.801 $∙$	0.828	0.831		Auc ( $↑$ )	30	0.917 $∙$	0.928 $∙$	0.937	0.937
	Auc ( $↑$ )	70	0.819 $∙$	0.825 $∙$	0.838	0.840		Auc ( $↑$ )	70	0.935 $∙$	0.942 $∙$	0.950	0.952
	Cvg ( $↓$ )	30	9.033 $∙$	7.369 $∙$	6.281 $∙$	5.867		Cvg ( $↓$ )	30	4.643 $∙$	3.954 $∙$	3.279 $∙$	2.863
	Cvg ( $↓$ )	70	7.262 $∙$	6.431 $∙$	5.432	5.352		Cvg ( $↓$ )	70	3.670 $∙$	3.303 $∙$	2.580	2.579
	Ap ( $↑$ )	30	0.529 $∙$	0.503 $∙$	0.517 $∙$	0.572		Ap ( $↑$ )	30	0.843 $∙$	0.866 $∙$	0.858 $∙$	0.879
	Ap ( $↑$ )	70	0.583 $∙$	0.589 $∙$	0.588 $∙$	0.607		Ap ( $↑$ )	70	0.861 $∙$	0.870 $∙$	0.870 $∙$	0.881
Com	Rkl ( $↓$ )	30	0.201 $∙$	0.179 $∙$	0.152	0.154	Edu	Rkl ( $↓$ )	30	0.187 $∙$	0.176 $∙$	0.144 $∙$	0.137
	Rkl ( $↓$ )	70	0.150 $∙$	0.141 $∙$	0.115	0.113		Rkl ( $↓$ )	70	0.165 $∙$	0.151 $∙$	0.113	0.111
	Auc ( $↑$ )	30	0.849 $∙$	0.880	0.873 $∙$	0.883		Auc ( $↑$ )	30	0.815 $∙$	0.817 $∙$	0.845	0.846
	Auc ( $↑$ )	70	0.868 $∙$	0.894	0.895	0.896		Auc ( $↑$ )	70	0.844 $∙$	0.842 $∙$	0.860	0.860
	Cvg ( $↓$ )	30	8.808 $∙$	7.392 $∙$	6.052 $∙$	5.798		Cvg ( $↓$ )	30	11.089 $∙$	9.672 $∙$	6.350	6.338
	Cvg ( $↓$ )	70	6.871 $∙$	6.306 $∙$	5.000	4.976		Cvg ( $↓$ )	70	8.096 $∙$	7.595 $∙$	5.075	5.070
	Ap ( $↑$ )	30	0.631 $∙$	0.646 $∙$	0.636 $∙$	0.669		Ap ( $↑$ )	30	0.538 $∙$	0.537 $∙$	0.543 $∙$	0.592
	Ap ( $↑$ )	70	0.674 $∙$	0.665 $∙$	0.667 $∙$	0.691		Ap ( $↑$ )	70	0.586 $∙$	0.591 $∙$	0.600 $∙$	0.622
Ent	Rkl ( $↓$ )	30	0.229 $∙$	0.175 $∙$	0.152 $∙$	0.122	Hea	Rkl ( $↓$ )	30	0.137 $∙$	0.095 $∙$	0.085	0.085
	Rkl ( $↓$ )	70	0.164 $∙$	0.159 $∙$	0.129 $∙$	0.109		Rkl ( $↓$ )	70	0.109 $∙$	0.074 $∙$	0.071 $∙$	0.065
	Auc ( $↑$ )	30	0.832 $∙$	0.826 $∙$	0.849 $∙$	0.859		Auc ( $↑$ )	30	0.894 $∙$	0.896 $∙$	0.907	0.906
	Auc ( $↑$ )	70	0.842	0.850 $∙$	0.870	0.871		Auc ( $↑$ )	70	0.901 $∙$	0.920	0.920	0.920
	Cvg ( $↓$ )	30	6.029 $∙$	5.755 $∙$	4.170	4.153		Cvg ( $↓$ )	30	7.104 $∙$	6.248 $∙$	4.924	4.814
	Cvg ( $↓$ )	70	4.857 $∙$	4.643 $∙$	3.483 $∙$	3.117		Cvg ( $↓$ )	70	5.866 $∙$	5.167 $∙$	3.960	3.963
	Ap ( $↑$ )	30	0.601 $∙$	0.601 $∙$	0.601 $∙$	0.645		Ap ( $↑$ )	30	0.727 $∙$	0.715 $∙$	0.720 $∙$	0.752
	Ap ( $↑$ )	70	0.635 $∙$	0.645 $∙$	0.643 $∙$	0.670		Ap ( $↑$ )	70	0.762 $∙$	0.770 $∙$	0.766 $∙$	0.775
Rec	Rkl ( $↓$ )	30	0.266 $∙$	0.245 $∙$	0.202 $∙$	0.165	Ref	Rkl ( $↓$ )	30	0.199 $∙$	0.187 $∙$	0.137 $∙$	0.098
	Rkl ( $↓$ )	70	0.204 $∙$	0.196 $∙$	0.167 $∙$	0.156		Rkl ( $↓$ )	70	0.155 $∙$	0.145 $∙$	0.098 $∙$	0.086
	Auc ( $↑$ )	30	0.785 $∙$	0.828 $∙$	0.802 $∙$	0.839		Auc ( $↑$ )	30	0.851 $∙$	0.847 $∙$	0.868 $∙$	0.886
	Auc ( $↑$ )	70	0.800 $∙$	0.837 $∙$	0.836 $∙$	0.845		Auc ( $↑$ )	70	0.861 $∙$	0.869 $∙$	0.895	0.898
	Cvg ( $↓$ )	30	7.084 $∙$	6.842 $∙$	5.397 $∙$	4.545		Cvg ( $↓$ )	30	7.549 $∙$	6.463 $∙$	5.052 $∙$	3.367
	Cvg ( $↓$ )	70	5.952 $∙$	5.685 $∙$	4.490	4.430		Cvg ( $↓$ )	70	6.419 $∙$	6.130 $∙$	3.694 $∙$	3.348
	Ap ( $↑$ )	30	0.547 $∙$	0.540 $∙$	0.540 $∙$	0.573		Ap ( $↑$ )	30	0.631	0.609 $∙$	0.611 $∙$	0.638
	Ap ( $↑$ )	70	0.597 $∙$	0.567 $∙$	0.600 $∙$	0.614		Ap ( $↑$ )	70	0.675	0.653 $∙$	0.653 $∙$	0.672
Sci	Rkl ( $↓$ )	30	0.257 $∙$	0.203 $∙$	0.169 $∙$	0.144	Soc	Rkl ( $↓$ )	30	0.149 $∙$	0.089 $∙$	0.095 $∙$	0.075
	Rkl ( $↓$ )	70	0.189 $∙$	0.174 $∙$	0.134	0.129		Rkl ( $↓$ )	70	0.108 $∙$	0.079 $∙$	0.076 $∙$	0.073
	Auc ( $↑$ )	30	0.827 $∙$	0.827 $∙$	0.830 $∙$	0.837		Auc ( $↑$ )	30	0.906 $∙$	0.906 $∙$	0.905 $∙$	0.913
	Auc ( $↑$ )	70	0.840 $∙$	0.849	0.850	0.850		Auc ( $↑$ )	70	0.910 $∙$	0.900 $∙$	0.914	0.914
	Cvg ( $↓$ )	30	12.805 $∙$	10.587 $∙$	8.794 $∙$	6.809		Cvg ( $↓$ )	30	7.652 $∙$	7.567 $∙$	6.308	6.088
	Cvg ( $↓$ )	70	9.960 $∙$	9.501 $∙$	6.900 $∙$	6.416		Cvg ( $↓$ )	70	5.886 $∙$	5.386 $∙$	5.103	4.929
	Ap ( $↑$ )	30	0.503 $∙$	0.479 $∙$	0.485 $∙$	0.531		Ap ( $↑$ )	30	0.712 $∙$	0.682 $∙$	0.700 $∙$	0.738
	Ap ( $↑$ )	70	0.569	0.551 $∙$	0.570 $∙$	0.574		Ap ( $↑$ )	70	0.748 $∙$	0.719 $∙$	0.728 $∙$	0.761
Soci	Rkl ( $↓$ )	30	0.252 $∙$	0.202 $∙$	0.175 $∙$	0.139	Enr	Rkl ( $↓$ )	30	0.179 $∙$	0.172 $∙$	0.173 $∙$	0.149
	Rkl ( $↓$ )	70	0.208 $∙$	0.194 $∙$	0.141 $∙$	0.136		Rkl ( $↓$ )	70	0.170 $∙$	0.162 $∙$	0.152 $∙$	0.129
	Auc ( $↑$ )	30	0.804 $∙$	0.808 $∙$	0.826	0.826		Auc ( $↑$ )	30	0.820 $∙$	0.830 $∙$	9,843 $∙$	0.853
	Auc ( $↑$ )	70	0.816 $∙$	0.816 $∙$	0.840	0.840		Auc ( $↑$ )	70	0.829 $∙$	0.839 $∙$	0.849 $∙$	0.872
	Cvg ( $↓$ )	30	9.550 $∙$	8.637 $∙$	6.944 $∙$	5.816		Cvg ( $↓$ )	30	22.72 $∙$	21.41 $∙$	20.42 $∙$	19.01
	Cvg ( $↓$ )	70	8.227 $∙$	7.638 $∙$	5.750	5.750		Cvg ( $↓$ )	70	21.90 $∙$	19.53 $∙$	18.17 $∙$	17.16
	Ap ( $↑$ )	30	0.569 $∙$	0.563 $∙$	0.565 $∙$	0.601		Ap ( $↑$ )	30	0.580 $∙$	0.582 $∙$	0.580 $∙$	0.589
	Ap ( $↑$ )	70	0.606 $∙$	0.589 $∙$	0.590 $∙$	0.625		Ap ( $↑$ )	70	0.585 $∙$	0.601 $∙$	0.607 $∙$	0.635
Cor	Rkl ( $↓$ )	30	0.332 $∙$	0.308 $∙$	0.331 $∙$	0.285	Ima	Rkl ( $↓$ )	30	0.224 $∙$	0.204 $∙$	0.220 $∙$	0.200
	Rkl ( $↓$ )	70	0.248 $∙$	0.250 $∙$	0.199	0.194		Rkl ( $↓$ )	70	0.195 $∙$	0.188	0.197 $∙$	0.187
	Auc ( $↑$ )	30	0.673 $∙$	0.693 $∙$	0.670 $∙$	0.714		Auc ( $↑$ )	30	0.796 $∙$	0.795 $∙$	0.800	0.801
	Auc ( $↑$ )	70	0.747 $∙$	0.749 $∙$	0.801	0.805		Auc ( $↑$ )	70	0.812	0.811	0.810	0.813
	Cvg ( $↓$ )	30	275.41 $∙$	233.83 $∙$	240.17 $∙$	211.84		Cvg ( $↓$ )	30	1.160 $∙$	1.103 $∙$	1.131 $∙$	1.070
	Cvg ( $↓$ )	70	212.84 $∙$	190.83 $∙$	160.59 $∙$	151.23		Cvg ( $↓$ )	70	1.066 $∙$	1.030	1.040 $∙$	1.025
	Ap ( $↑$ )	30	0.158 $∙$	0.166 $∙$	0.165 $∙$	0.174		Ap ( $↑$ )	30	0.745 $∙$	0.752 $∙$	0.744 $∙$	0.760
	Ap ( $↑$ )	70	0.176 $∙$	0.185 $∙$	0.188	0.192		Ap ( $↑$ )	70	0.768 $∙$	0.772	0.770 $∙$	0.777

Table 6. Table 6: CPU timing results for learning with missing labels ( ρ = 70 𝜌 70 \rho=70 ). F is the time to fill in the missing labels. C is the time for clustering, I is the time for initialization, and R is the time of the main learning procedure. A is the total time (sum of F, I, C and R). Note that some algorithms may not need F, C or I.

	MBR			MMLLOC					LEML			ML-LRC			GLOCAL
	A	F	R	A	F	C	I	R	A	I	R	A	I	R	A	C	I	R
Arts	109	8	101	107	8	1	0	98	34	0	34	87	0	87	47	1	20	26
Business	38	6	32	104	6	1	0	97	35	0	35	82	0	82	49	1	24	24
Computers	78	11	67	121	11	1	0	109	46	0	46	94	0	94	53	1	31	21
Education	60	8	52	115	8	1	0	106	45	0	45	64	0	64	45	1	29	15
Entertainment	66	6	60	91	6	1	0	84	42	0	42	73	0	73	53	2	22	29
Health	64	11	53	116	11	1	0	104	41	0	41	75	0	75	67	1	32	34
Recreation	63	4	59	97	5	1	0	91	46	0	46	55	0	55	51	2	22	27
Reference	75	14	61	131	15	9	0	107	38	0	38	91	0	91	78	8	32	38
Science	101	15	86	133	15	1	0	117	53	0	53	103	0	103	77	2	32	43
Social	163	36	127	149	33	8	0	108	37	0	37	147	0	147	90	7	35	48
Society	83	8	75	106	8	1	0	97	32	0	32	117	0	117	44	2	18	24
Enron	47	10	37	59	10	1	0	48	38	0	38	78	0	78	69	1	25	43
Corel5k	458	272	186	1529	268	1	0	1260	307	0	307	709	0	709	413	1	78	344
Image	5	1	4	25	2	1	0	22	28	0	28	14	0	14	15	1	5	9

Equations38

U, V, W min ∥ Π_{Ω} (Y - UV) ∥_{F}^{2} + λ ∥ V - W^{⊤} X ∥_{F}^{2} + λ_{2} R (U, V, W),

U, V, W min ∥ Π_{Ω} (Y - UV) ∥_{F}^{2} + λ ∥ V - W^{⊤} X ∥_{F}^{2} + λ_{2} R (U, V, W),

U, V, W min ∥ Π_{Ω} (Y - UV) ∥_{F}^{2} + λ ∥ V - W^{⊤} X ∥_{F}^{2} + λ_{2} R (U, V, W) + λ_{3} tr (F_{0}^{⊤} L_{0} F_{0}) + m = 1 \sum g λ_{4} tr (F_{m}^{⊤} L_{m} F_{m}),

U, V, W min ∥ Π_{Ω} (Y - UV) ∥_{F}^{2} + λ ∥ V - W^{⊤} X ∥_{F}^{2} + λ_{2} R (U, V, W) + λ_{3} tr (F_{0}^{⊤} L_{0} F_{0}) + m = 1 \sum g λ_{4} tr (F_{m}^{⊤} L_{m} F_{m}),

U, V, W min

U, V, W min

U, V, W, Z min

U, V, W, Z min

Z_{m} min

Z_{m} min

\nabla_{Z_{m}} = \frac{λ _{3} n _{m}}{n} U W^{⊤} X X^{⊤} W U^{⊤} Z_{m} + λ_{4} UW^{⊤} X_{m} X_{m}^{⊤} WU^{⊤} Z_{m} .

\nabla_{Z_{m}} = \frac{λ _{3} n _{m}}{n} U W^{⊤} X X^{⊤} W U^{⊤} Z_{m} + λ_{4} UW^{⊤} X_{m} X_{m}^{⊤} WU^{⊤} Z_{m} .

z_{m, j} \leftarrow z_{m, j} /∥ z_{m, j} ∥,

z_{m, j} \leftarrow z_{m, j} /∥ z_{m, j} ∥,

V min ∥ J \circ (Y - UV) ∥_{F}^{2} + λ ∥ V - W^{⊤} X ∥_{F}^{2} + λ_{2} ∥ V ∥_{F}^{2} .

V min ∥ J \circ (Y - UV) ∥_{F}^{2} + λ ∥ V - W^{⊤} X ∥_{F}^{2} + λ_{2} ∥ V ∥_{F}^{2} .

v_{i} min ∥ Diag (j_{i}) y_{i} - Diag (j_{i}) U v_{i} ∥^{2} + λ ∥ v_{i} - W^{⊤} x_{i} ∥^{2} + λ_{2} ∥ v_{i} ∥^{2} .

v_{i} min ∥ Diag (j_{i}) y_{i} - Diag (j_{i}) U v_{i} ∥^{2} + λ ∥ v_{i} - W^{⊤} x_{i} ∥^{2} + λ_{2} ∥ v_{i} ∥^{2} .

\bm{v}_{i}=\big{(}\bm{U}^{\top}\mathrm{Diag}(\bm{j}_{i})\bm{U}+(\lambda+\lambda_{2})\bm{\mathrm{I}}\big{)}^{-1}\big{(}\lambda\bm{W}^{\top}\bm{x}_{i}+\bm{U}^{\top}\mathrm{Diag}(\bm{j}_{i})\bm{y}_{i}\big{)}.

\bm{v}_{i}=\big{(}\bm{U}^{\top}\mathrm{Diag}(\bm{j}_{i})\bm{U}+(\lambda+\lambda_{2})\bm{\mathrm{I}}\big{)}^{-1}\big{(}\lambda\bm{W}^{\top}\bm{x}_{i}+\bm{U}^{\top}\mathrm{Diag}(\bm{j}_{i})\bm{y}_{i}\big{)}.

\nabla_{V} = U^{⊤} (J \circ (UV - Y)) + λ (V - W^{⊤} X) + λ_{2} V .

\nabla_{V} = U^{⊤} (J \circ (UV - Y)) + λ (V - W^{⊤} X) + λ_{2} V .

\min\limits_{\bm{U}}\|\bm{J}\circ(\bm{Y}\!-\!\bm{UV})\|_{F}^{2}+\lambda_{2}\|\bm{U}\|_{F}^{2}\!+\!\sum_{m=1}^{g}\!\Big{(}\!\frac{\lambda_{3}n_{m}}{n}\mathrm{tr}(\bm{F}_{0}^{\top}\!\bm{Z}_{m}\!\bm{Z}_{m}^{\top}\!\bm{F}_{0})\!+\!\lambda_{4}\mathrm{tr}(\bm{F}_{m}^{\top}\!\bm{Z}_{m}\!\bm{Z}_{m}^{\top}\!\bm{F}_{m})\!\Big{)}\!.

\min\limits_{\bm{U}}\|\bm{J}\circ(\bm{Y}\!-\!\bm{UV})\|_{F}^{2}+\lambda_{2}\|\bm{U}\|_{F}^{2}\!+\!\sum_{m=1}^{g}\!\Big{(}\!\frac{\lambda_{3}n_{m}}{n}\mathrm{tr}(\bm{F}_{0}^{\top}\!\bm{Z}_{m}\!\bm{Z}_{m}^{\top}\!\bm{F}_{0})\!+\!\lambda_{4}\mathrm{tr}(\bm{F}_{m}^{\top}\!\bm{Z}_{m}\!\bm{Z}_{m}^{\top}\!\bm{F}_{m})\!\Big{)}\!.

\nabla_{\bm{U}}\!=\!(\bm{J}\circ(\bm{UV}-\bm{Y}))\bm{V}^{\top}+\lambda_{2}\bm{U}\!+\sum_{m=1}^{g}\!\!\bm{Z}_{i}\bm{Z}_{i}^{\top}\bm{U}\Big{(}\frac{\lambda_{3}n_{m}}{n}\bm{W}^{\top}\bm{X}_{m}\bm{X}_{m}^{\top}\bm{W}\!+\!\lambda_{4}\bm{W}^{\top}\bm{XX}^{\top}\bm{W}\Big{)}.

\nabla_{\bm{U}}\!=\!(\bm{J}\circ(\bm{UV}-\bm{Y}))\bm{V}^{\top}+\lambda_{2}\bm{U}\!+\sum_{m=1}^{g}\!\!\bm{Z}_{i}\bm{Z}_{i}^{\top}\bm{U}\Big{(}\frac{\lambda_{3}n_{m}}{n}\bm{W}^{\top}\bm{X}_{m}\bm{X}_{m}^{\top}\bm{W}\!+\!\lambda_{4}\bm{W}^{\top}\bm{XX}^{\top}\bm{W}\Big{)}.

W min λ ∥ V - W^{⊤} X ∥_{F}^{2} + λ_{2} ∥ W ∥_{F}^{2} + m = 1 \sum g (\frac{λ _{3} n _{m}}{n} tr (F_{0}^{⊤} Z_{m} Z_{m}^{⊤} F_{0}) + λ_{4} tr (F_{m}^{⊤} Z_{m} Z_{m}^{⊤} F_{m})) .

W min λ ∥ V - W^{⊤} X ∥_{F}^{2} + λ_{2} ∥ W ∥_{F}^{2} + m = 1 \sum g (\frac{λ _{3} n _{m}}{n} tr (F_{0}^{⊤} Z_{m} Z_{m}^{⊤} F_{0}) + λ_{4} tr (F_{m}^{⊤} Z_{m} Z_{m}^{⊤} F_{m})) .

\nabla_{\bm{W}}=\lambda\bm{X}\left(\bm{X}^{\top}\bm{W}-\bm{V}^{\top}\right)+\lambda_{2}\bm{W}+\sum_{m=1}^{g}\Big{(}\frac{\lambda_{3}n_{m}}{n}\bm{XX}^{\top}+\lambda_{4}\bm{X}_{m}\bm{X}_{m}^{\top}\Big{)}\bm{WU}^{\top}\bm{Z}_{m}\bm{Z}_{m}^{\top}\bm{U}.

\nabla_{\bm{W}}=\lambda\bm{X}\left(\bm{X}^{\top}\bm{W}-\bm{V}^{\top}\right)+\lambda_{2}\bm{W}+\sum_{m=1}^{g}\Big{(}\frac{\lambda_{3}n_{m}}{n}\bm{XX}^{\top}+\lambda_{4}\bm{X}_{m}\bm{X}_{m}^{\top}\Big{)}\bm{WU}^{\top}\bm{Z}_{m}\bm{Z}_{m}^{\top}\bm{U}.

Rkl = \frac{1}{p} i = 1 \sum p \frac{∣ Q _{i} ∣}{∣ C _{i}^{+} ∣∣ C _{i}^{-} ∣} .

Rkl = \frac{1}{p} i = 1 \sum p \frac{∣ Q _{i} ∣}{∣ C _{i}^{+} ∣∣ C _{i}^{-} ∣} .

Auc = \frac{1}{l} j = 1 \sum l \frac{∣ Q ~ _{j} ∣}{∣ Z _{j}^{+} ∣∣ Z _{j}^{-} ∣} .

Auc = \frac{1}{l} j = 1 \sum l \frac{∣ Q ~ _{j} ∣}{∣ Z _{j}^{+} ∣∣ Z _{j}^{-} ∣} .

Cvg = \frac{1}{p} i = 1 \sum p max {rank_{f} (x_{i}, j) ∣ j \in C_{i}^{+}} - 1.

Cvg = \frac{1}{p} i = 1 \sum p max {rank_{f} (x_{i}, j) ∣ j \in C_{i}^{+}} - 1.

Ap = \frac{1}{p} i = 1 \sum p \frac{1}{∣ C _{i}^{+} ∣} \sum_{c \in C_{i}^{+}} \frac{∣ Q ^ _{i, c} ∣}{rank _{f} ( x _{i} , c )} .

Ap = \frac{1}{p} i = 1 \sum p \frac{1}{∣ C _{i}^{+} ∣} \sum_{c \in C_{i}^{+}} \frac{∣ Q ^ _{i, c} ∣}{rank _{f} ( x _{i} , c )} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Natural Language Processing Techniques · Music and Audio Processing

Full text

Multi-Label Learning with

Global and Local Label Correlation

Yue Zhu1

James T. Kwok2

Zhi-Hua Zhou1

1 National Key Laboratory for Novel Software Technology

Nanjing University, Nanjing 210093, China

Email: {zhuy, zhouzh}@lamda.nju.edu.cn

2the Department of Computer Science and Engineering

Hong Kong University of Science and Technology, Hong Kong

Email: [email protected]

Abstract

It is well-known that exploiting label correlations is important to multi-label learning. Existing approaches either assume that the label correlations are global and shared by all instances; or that the label correlations are local and shared only by a data subset. In fact, in the real-world applications, both cases may occur that some label correlations are globally applicable and some are shared only in a local group of instances. Moreover, it is also a usual case that only partial labels are observed, which makes the exploitation of the label correlations much more difficult. That is, it is hard to estimate the label correlations when many labels are absent. In this paper, we propose a new multi-label approach GLOCAL dealing with both the full-label and the missing-label cases, exploiting global and local label correlations simultaneously, through learning a latent label representation and optimizing label manifolds. The extensive experimental studies validate the effectiveness of our approach on both full-label and missing-label data.

keywords:

Global and local label correlation, label manifold, missing labels, multi-label learning.

\cortext

[cor1]Corresponding author.

1 Introduction

In real-world classification applications, an instance is often associated with more than one class labels. For example, a scene image can be annotated with several tags boutell2004learning , a document may belong to multiple topics ueda2002parametric , and a piece of music may be associated with different genres turnbull2008semantic . Thus, multi-label learning has attracted a lot of attention in recent years zhang2014review .

Current studies on multi-label learning try to incorporate label correlations of different orders zhang2014review . However, existing approaches mostly focus on global label correlations shared by all instances furnkranz2008multilabel ; ji2008extracting ; read2011classifier . For example, labels “fish” and “ocean” are highly correlated, and so are “stock” and “finance”. On the other hand, certain label correlations are only shared by a local data subset huang2012 . For example, “apple” is related to “fruit” in gourmet magazines, but is related to “digital devices” in technology magazines. Previous studies focus on exploiting either global or local label correlations. However, considering both of them is obviously more beneficial and desirable.

Another problem with label correlations is that they are usually difficult to specify manually. As label correlations may vary in different contexts and there is no unified measure for specifying appropriate correlations, they are usually estimated from the observed data. Some approaches learn the label hierarchies by hierarchical clustering Punera2005Automatically or Bayesian network structure learning zhang2010multi . However, the hierarchical structure may not exist in some applications. For example, labels such as “desert”, “mountains”, “sea”, “sunset” and “trees” do not have any natural hierarchical correlations, and label hierarchies may not be useful. Others estimate label correlations by the co-occurrence of labels in training data NIPS2011_4239 . However, it may cause overfitting. Moreover, co-occurrence is less meaningful for labels with very few positive instances.

In multi-label learning, some labels may be missing from the training set. For example, human labelers may ignore object classes they do not know or of little interest. Recently, multi-label learning with missing labels has become a hot topic. Xu et al. xu2013speedup and Yu et al. Yu2014 considered using the low-rank structure on the instance-label mapping. A more direct approach to model the label dependency approximates the label matrix as a product of two low-rank matrices goldberg2010transduction . This leads to simpler recovery of the missing labels, and produces a latent representation of the label matrix.

In the missing label cases, estimation of label correlation becomes even more difficult, as the observed label distribution is different from the true one. As a result, the aforementioned methods (based on hierarchical clustering and co-occurrence, for example) will produce biased estimates of label correlations.

In this paper, we propose a new approach called “Multi-Label Learning with GLObal and loCAL Correlation” (GLOCAL), which simultaneously recovers the missing labels, trains the linear classifiers and exploits both global and local label correlations. It learns a latent label representation. Classifier outputs are encouraged to be similar on highly positively correlated labels, and dissimilar on highly negatively correlated labels. We do not assume the presence of external knowledge sources specifying the label correlations. Instead, these correlations are learned simultaneously with the latent label representations and instance-label mapping.

The rest of the paper is organized as follows. In Section 2, related works of multi-label learning with label correlations are introduced. In Section 3, the problem formulation and the GLOCAL approach are proposed. Experimental results are presented in Section 4. Finally, Section 5 concludes the work.

Notations For a matrix $\bm{A}$ , $\bm{A}^{\top}$ denotes its transpose, $\mathrm{tr(\bm{A})}$ is its trace, $\|\bm{A}\|_{F}$ is its Frobenius norm, and $\text{diag}(\bm{A})$ returns a vector containing the diagonal elements of $\bm{A}$ . For two matrices $\bm{A}$ and $\bm{B}$ , $\bm{A}\circ\bm{B}$ denotes the Hadamard (element-wise) product. For a vector $\bm{c}$ , $\|\bm{c}\|_{2}$ is its $\ell_{2}$ -norm, and $\text{Diag}(\bm{c})$ returns a diagonal matrix with $\bm{c}$ on the diagonal.

2 Related Work

Multi-label learning has been widely studied in recent years. Based on the degree of label correlations used, it can be divided into three categories zhang2014review : (i) first-order; (ii) second-order; and (iii) high-order. For the first-order strategy, label correlations are not considered, and the multi-label problem is transformed into multiple independent binary classification problems. For example, BR boutell2004learning trains a classifier for each label independently. For the second-order strategy, pairwise label relations are considered. For example, CLR furnkranz2008multilabel transforms the multi-label learning problem into the pairwise label ranking problem. For the high-order strategy, all other labels’ influences imposed on each label are taken into account. For example, CC read2011classifier transforms the multi-label learning problem into a chain of binary classification problems, with the ground-truth labels encoded into the features.

Most previous studies focus on global label correlations. However, MLLOC huang2012 demonstrates that sometimes label correlations may only be shared by a local data subset. Specifically, it enhances the feature representation of each instance by embedding a code into the feature space, which encodes the influence of labels of an instance to the local label correlations. This has some limitations. First, when the dimensionality of the feature space is large, the code is less discriminative and will be dominated by the original features. Second, MLLOC considers only the local label correlations, but not the global ones. Third, MLLOC cannot learn with missing labels.

In some real-world applications, labels are partially observed, and multi-label learning with missing labels has attracted much attention. MAXIDE xu2013speedup is based on fast low-rank matrix completion, and has strong theoretical guarantees. However, it only works in the transductive setting. Moreover, a label correlation matrix has to be specified manually. LEML Yu2014 also relies on a low-rank structure, and works in an inductive setting. However, it only implicitly uses global label correlations. ML-LRC xu2014learning adopts a low-rank structure to capture global label correlations, and addresses the missing labels by introducing a supplementary label matrix. However, only global label correlations are taken into account. Obviously, it would be more desirable to learn both global and local label correlations simultaneously.

Manifold regularization belkin2006manifold exploits instance similarity by forcing the predicted values on similar instances to be similar. A similar idea can be adapted to the label manifold, and so predicted values for correlated labels should be similar. However, the Laplacian matrix is based on some label similarity or correlation matrix, which can be hard to specify as discussed in Section 1.

3 The Proposed Approach

In multi-label learning, an instance can be associated with multiple class labels. Let $\bm{C}=\{c_{1},\dots,c_{l}\}$ be the class label set of $l$ labels. We denote the feature vector of an instance by $\bm{x}\in\mathcal{X}\subseteq\mathbb{R}^{d}$ , and denote the ground-truth label vector by $\tilde{\bm{y}}\in\mathcal{Y}\subseteq\{-1,1\}^{l}$ , where $[\tilde{\bm{y}}]_{j}=1$ if $\bm{x}$ is with class label $c_{j}$ , and $-1$ otherwise. As mentioned in Section 1, instances in the training data may be partially labeled, i.e., some labels may be missing. We adopt the general setting that both positive and negative labels can be missing goldberg2010transduction ; xu2013speedup ; Yu2014 . The observed label vector is denoted $\bm{y}$ , where $[\bm{y}]_{j}=0$ if class label $c_{j}$ is not labeled (i.e. it is missing), and $[\bm{y}]_{j}=[\tilde{\bm{y}}]_{j}$ otherwise. Given the training data $\mathcal{D}=\{(\bm{x}_{i},\bm{y}_{i})\}_{i=1}^{n}$ , our goal is to learn a mapping function $\Psi:\mathcal{X}\rightarrow\mathcal{Y}$ .

In this paper, we propose the GLOCAL algorithm, which learns and exploits both global and local label correlations via label manifolds. To recover the missing labels, learning of the latent label representation and classifier training are performed simultaneously.

3.1 Basic Model

Let $\tilde{\bm{Y}}\!=\![\tilde{\bm{y}}_{1},\dots,\tilde{\bm{y}}_{n}]\!\in\!\{-1,1\}^{l\times n}$ be the ground-truth label matrix, where each $\tilde{\bm{y}}_{i}$ is the label vector for instance $i$ . As discussed in Section 1, $\tilde{\bm{Y}}$ is low-rank. Let its rank be $k<l$ . Thus, $\tilde{\bm{Y}}$ can be written as the low-rank decomposition $\bm{UV}$ , where $\bm{U}\in\mathbb{R}^{l\times k}$ and $\bm{V}\in\mathbb{R}^{k\times n}$ . Intuitively, $\bm{V}$ represents the latent labels that are more compact and more semantically abstract than the original labels, while matrix $\bm{U}$ projects the original labels to the latent label space.

In general, the labels are only partially observed. Let the observed label matrix be $\bm{Y}\!=\![\bm{y}_{1},\dots,\bm{y}_{n}]\!\in\!\{-1,0,1\}^{l\times n}$ , and $\Omega$ be the set containing indices of the observed labels in $\bm{Y}$ (i.e., indices of the nonzero elements in $\bm{Y}$ ). We focus on minimizing the reconstruction error on the observed labels, i.e., $\|\Pi_{\Omega}(\bm{Y-UV})\|_{F}^{2}$ , where $[\Pi_{\Omega}(\bm{A})]_{ij}=A_{ij}$ if $\left(i,j\right)\in\Omega$ , and 0 otherwise. Moreover, we use a linear mapping $\bm{W}\in\mathbb{R}^{d\times k}$ to map instances to the latent labels. This $\bm{W}$ is learned by minimizing $\|\bm{V-W}^{\top}\bm{X}\|_{F}^{2}$ , where $\bm{X}=[\bm{x}_{1},\dots,\bm{x}_{n}]\in\mathbb{R}^{d\times n}$ is the instance matrix. Combining these two, we obtain the following optimization problem:

[TABLE]

where $\mathcal{R}(\bm{U,V,W})$ is a regularizer and $\lambda$ , $\lambda_{2}$ are tradeoff parameters. While the square loss has been used in Eqn (1), it can be replaced by any differentiable loss function. The prediction on $\bm{\bm{x}}$ is $\mathrm{sign}(\bm{f}(\bm{x}))$ , where $\bm{f}(\bm{x})=\bm{UW}^{\top}\bm{x}$ . Let $\bm{f}=[f_{1},\cdots,f_{l}]^{\top}$ , thus $f_{j}(\bm{x})$ denotes the predictive value on $j$ -th label for $\bm{x}$ . We concatenate all $\bm{f}(\bm{x}),\forall\bm{x}\in\bm{X}$ , denoted by $F_{0}$ , thus $F_{0}=[\bm{f}(\bm{x}_{1}),\cdots,\bm{f}(\bm{x}_{n})]=\bm{UW}^{\top}\bm{X}$ .

3.2 Global and Local Manifold Regularizers

Exploiting label correlations is an essential ingredient in multi-label learning. Here, we use label correlations to regularize the model. Intuitively, the more positively correlated two labels are, the closer are the corresponding classifier outputs, and vice versa. Let $\bm{S}_{0}=[S_{ij}]\in\mathbb{R}^{l\times l}$ be the global label correlation matrix. The manifold regularizer $\sum_{i,j}S_{ij}\|\bm{f}_{i,:}-\bm{f}_{j,:}\|_{2}^{2}$ should have a small value melacci2011primallapsvm . Here, $\bm{f}_{i,:}$ , the $i$ th row of $\bm{F}_{0}$ , is the vector of classifier outputs for the $i$ th label on the $n$ samples. Let $\bm{D}_{0}$ be the diagonal matrix with diagonal $\bm{S}_{0}\mathbf{1}$ , where $\mathbf{1}$ is the vector of ones. The manifold regularizer can be equivalently written as $\mathrm{tr}(\bm{F}_{0}^{\top}\bm{L}_{0}\bm{F}_{0})$ luo2009non , where $\bm{L}_{0}=\bm{D}_{0}-\bm{S}_{0}$ is the Laplacian matrix of $\bm{S}_{0}$ .

As discussed in Section 1, label correlations may vary from one local region to another. Assume that the data $\bm{X}$ is partitioned into $g$ groups $\{\bm{X}_{1},\dots,\bm{X}_{g}\}$ , where $\bm{X}_{m}\in\mathbb{R}^{d\times n_{m}}$ has size $n_{m}$ . This partitioning can be obtained by domain knowledge (e.g., gene pathways subramanian2005gene and networks chuang2007network in bioinformatics applications) or clustering. Let $\bm{Y}_{m}$ be the label submatrix in $\bm{Y}$ corresponding to $\bm{X}_{m}$ , and $\bm{S}_{m}\in\mathbb{R}^{l\times l}$ be the local label correlation matrix of group $m$ . Similar to global label correlation, to encourage the classifier outputs to be similar on the positively correlated labels and dissimilar on the negatively correlated ones, we minimize $\mathrm{tr}(\bm{F}_{m}^{\top}\bm{L}_{m}\bm{F}_{m})$ , where $\bm{L}_{m}$ is the Laplacian matrix of $\bm{S}_{m}$ and $\bm{F}_{m}=\bm{UW}^{\top}\bm{X}_{m}$ is the classifier output matrix for group $m$ .

Combining global and local label correlations with Eqn. (1), we have the following optimization problem:

[TABLE]

where $\lambda,\lambda_{2},\lambda_{3},\lambda_{4}$ are tradeoff parameters.

Intuitively, a large local group contributes more to the global label correlations. In particular, the following Lemma shows that when the cosine similarity is used to compute $\bm{S}_{ij}$ , we have $\bm{S}_{0}=\sum_{m=1}^{g}\frac{n_{m}}{n}\bm{S}_{m}$ .

Lemma 1

Let $[\bm{S}_{0}]_{ij}=\frac{\bm{y}_{i,:}\bm{y}_{j,:}^{\top}}{\|\bm{y}_{i,:}\|\|\bm{y}_{j,:}\|}$ and $[\bm{S}_{m}]_{ij}=\frac{\bm{y}_{m,i,:}\bm{y}_{m,j,:}^{\top}}{\|\bm{y}_{m,i,:}\|\|\bm{y}_{m,j,:}\|}$ , where $\bm{y}_{i,:}$ is the $i$ th row of $\bm{Y}$ , and $\bm{y}_{m,i,:}$ is the $i$ th row of $\bm{Y}_{m}$ . Then, $\bm{S}_{0}=\sum_{m=1}^{g}\frac{n_{m}}{n}\bm{S}_{m}$ .

In general, when the global label correlation matrix is a linear combination of the local label correlation matrices, the following Proposition shows that the global label Laplacian matrix is also a linear combination of the local label Laplacian matrices with the same combination coefficients.

Proposition 1

If $\bm{S}_{0}\!=\!\sum_{m=1}^{g}\beta_{m}\bm{S}_{m}$ , then $\bm{L}_{0}\!=\!\sum_{m=1}^{g}\beta_{m}\bm{L}_{m}$ .

Using Lemma 1 and Proposition 1, Eqn. (2) can then be rewritten as follows:

[TABLE]

The success of label manifold regularization hinges on a good correlation matrix (or equivalently, a good Laplacian matrix). In multi-label learning, one rudimentary approach is to compute the correlation coefficient between two labels by cosine distance wang2009image . However, this can be noisy since some labels may only have very few positive instances in the training data. When labels can be missing, this computation may even become misleading, since the label distribution of observed labels may be much different from that of the ground-truth label distribution due to the missing labels.

In this paper, instead of specifying any correlation metric or label correlation matrix, we learn the Laplacian matrices directly. Note that the Laplacian matrices are symmetric positive definite. Thus, for $m\in\{1,\ldots,g\}$ , we decompose $\bm{L}_{m}$ as $\bm{Z}_{m}\bm{Z}_{m}^{\top}$ , where $\bm{Z}_{m}\!\in\!\mathbb{R}^{l\times k}$ . For simplicity, $k$ is set to the dimensionality of the latent representation $\bm{V}$ . As a result, learning the Laplacian matrices is transformed to learning $\bm{\mathcal{Z}}\equiv\{\bm{Z}_{1},\dots,\bm{Z}_{g}\}$ . Note that optimization w.r.t. $\bm{Z}_{m}$ may lead to the trivial solution $\bm{Z}_{m}=\mathbf{0}$ . To avoid this problem, we add the constraint that the diagonal entries in $\bm{Z}_{m}\bm{Z}_{m}^{\top}$ are 1, for $m\in\{1,\cdots,g\}$ . This constraint also enables us to obtain a normalized Laplacian matrix chung1997spectral of $L_{m}$ .

Let $\bm{J}=[\bm{J}_{ij}]$ be the indicator matrix with $\bm{J}_{ij}=1$ if $(i,j)\in\Omega$ , and 0 otherwise. $\Pi_{\Omega}(\bm{Y-UV})$ can be rewritten as the Hadamard product $\bm{J}\circ\bm{(Y-UV})$ . Combining the decomposition of Laplacian matrices and the diagonal constraints of $\bm{Z}_{m}$ , we obtain the optimization problem as:

[TABLE]

Moreover, we will use $\mathcal{R}(\bm{U,V,W})=\|\bm{U}\|_{F}^{2}+\|\bm{V}\|_{F}^{2}+\|\bm{W}\|_{F}^{2}$ .

3.3 Learning by Alternating Minimization

Problem (4) can be solved by alternating minimization (Algorithm 1). In each iteration, we update one of the variables in $\{\bm{Z},\bm{U},\bm{V},\bm{W}\}$ with gradient descent, and leave the others fixed. Specifically, the MANOPT toolbox manopt is utilized to implement gradient descent with line search on the Euclidean space for the update of $\bm{U},\bm{V},\bm{W}$ , and on the manifolds for the update of $\bm{Z}$ .

3.3.1 Updating $\bm{Z}_{m}$

With $\bm{U},\bm{V},\bm{W}$ fixed, problem (4) reduces to

[TABLE]

for each $m\in\{1,\dots,g\}$ . Due to the constraint $\mathrm{diag}(\bm{Z}_{m}\bm{Z}_{m}^{\top})=\mathbf{1}$ , it has no closed-form solution, and we will solve it with projected gradient descent. The gradient of the objective w.r.t. $\bm{Z}_{m}$ is

[TABLE]

To satisfy the constraint $\mathrm{diag}(\bm{Z}_{m}\bm{Z}_{m}^{\top})=\mathbf{1}$ , we project each row of $\bm{Z}_{m}$ onto the unit norm ball after each update:

[TABLE]

where $\bm{z}_{m,j}$ is the $j$ th row of $\bm{Z}_{m}$ .

3.3.2 Updating $\bm{V}$

With $\bm{Z}_{m}$ ’s and $\bm{U},\bm{W}$ fixed, problem (4) reduces to

[TABLE]

Notice that each column of $\bm{V}$ is independent to each other, and thus $\bm{V}$ can be solved column-by-column. Let $\bm{j}_{i}$ and $\bm{v}_{i}$ be $i$ th column of $\bm{J}$ and $\bm{V}$ , respectively. The optimization problem for $\bm{v}_{i}$ can be written as:

[TABLE]

Setting the gradient w.r.t. $\bm{v}_{i}$ to 0, we obtain the following closed-form solution of $\bm{v}_{i}$ :

[TABLE]

This involves computing a matrix inverse for each $i$ . If this is expensive, we can use gradient descent instead. The gradient of the objective in (6) w.r.t. $\bm{V}$ is

[TABLE]

3.3.3 Updating $\bm{U}$

With $\bm{Z}_{m}$ ’s and $\bm{V},\bm{W}$ fixed, problem (4) reduces to

[TABLE]

Again, we use gradient descent, and the gradient w.r.t. $\bm{U}$ is:

[TABLE]

3.3.4 Updating $\bm{W}$

With $\bm{Z}_{m}$ ’s and $\bm{U},\bm{V}$ fixed, problem (4) reduces to

[TABLE]

The gradient w.r.t. $W$ is:

[TABLE]

4 Experiments

In this section, extensive experiments are performed on text and image datasets. Performance on both the full-label and missing-label cases are discussed.

4.1 Setup

4.1.1 Data sets

On text, eleven Yahoo datasets111http://www.kecl.ntt.co.jp/as/members/ueda/yahoo.tar (Arts, Business, Computers, Education, Entertainment, Health, Recreation, Reference, Science, Social and Society) and the Enron dataset222http://mulan.sourceforge.net/datasets-mlc.html are used. On images, the Corel5k33footnotemark: 3 and Image333http://cse.seu.edu.cn/people/zhangml/files/Image.rar datasets are used. In the sequel, each dataset is denoted by its first three letters.444“Society” is denoted “Soci”, so as to distinguish it from “Social”. Detailed information of the datasets are shown in Table 1. For each dataset, we randomly select $60\%$ of the instances for training, and the rest for testing.

4.1.2 Baselines

In the GLOCAL algorithm, we use the kmeans clustering algorithm to partition the data into local groups. The solution of Eqn. (1) is used to warm-start $\bm{U},\bm{V}$ and $\bm{W}$ . The $\bm{\mathcal{Z}}_{m}$ ’s are randomly initialized. GLOCAL is compared with the following state-of-the-art multi-label learning algorithms:

BR boutell2004learning , which trains a binary linear SVM (using the LIBLINEAR package REF08a ) for each label independently; 2. 2.

MLLOC huang2012 , which exploits local label correlations by encoding them into the instance’s feature representation; 3. 3.

LEML Yu2014 , which learns a linear instance-to-label mapping with low-rank structure, and implicitly takes advantage of global label correlation; 4. 4.

ML-LRC xu2014learning , which learns and exploits low-rank global label correlations for multi-label classification with missing labels.

Note that BR does not take label correlation into account. MLLOC considers only local label correlations; LEML implicitly uses global label correlations, whereas ML-LRC models global label correlation directly. On the ability to handle missing labels, BR and MLLOC can only learn with full labels.

For simplicity, we set $\lambda=1$ in GLOCAL. The other parameters, as well as those of the baseline methods, are selected via 5-fold cross-validation on the training set. All the algorithms are implemented in Matlab (with some C++ code for LEML).

4.1.3 Performance Evaluation

Let $p$ be the number of test instances, $\bm{C}_{i}^{+},\bm{C}_{i}^{-}$ be the sets of positive and negative labels associated with the $i$ th instance; and $\bm{Z}_{j}^{+},\bm{Z}_{j}^{-}$ be the sets of positive and negative instances belonging to the $j$ th label. Given input $\bm{x}$ , let $\mathrm{rank}_{\bm{f}}(\bm{x},y)$ be the rank of label $y$ in the predicted label ranking (sorted in descending order). For performance evaluation, we use the following popular metrics in multi-label learning zhang2014review :

Ranking loss (Rkl): This is the fraction that a negative label is ranked higher than a positive label. For instance $i$ , define $\bm{Q}_{i}=\{(j^{\prime},j^{\prime\prime})\;|\;f_{j^{\prime}}(\bm{x}_{i})\leq f_{j^{\prime\prime}}(\bm{x}_{i}),(j^{\prime},j^{\prime\prime})\in\bm{C}_{i}^{+}\times\bm{C}_{i}^{-}\}$ . Then,

[TABLE] 2. 2.

Average AUC (Auc): This is the fraction that a positive instance is ranked higher than a negative instance, averaged over all labels. Specifically, for label $j$ , define $\bm{\tilde{Q}}_{j}=\{(i^{\prime},i^{\prime\prime})\;|\;f_{j}(\bm{x}_{i^{\prime}})\geq f_{j}(\bm{x}_{i^{\prime\prime}}),(\bm{x}_{i^{\prime}},\bm{x}_{i^{\prime\prime}})\in\bm{Z}_{j}^{+}\times\bm{Z}_{j}^{-}\}$ . Then,

[TABLE] 3. 3.

Coverage (Cvg): This counts how many steps are needed to move down the predicted label ranking so as to cover all the positive labels of the instances.

[TABLE] 4. 4.

Average precision (Ap): This is the average fraction of positive labels ranked higher than a particular positive label. For instance $i$ , define $\bm{\hat{Q}}_{i,c}=\{j\;|\;\mathrm{rank}_{\bm{f}}(\bm{x}_{i},j)\leq\mathrm{rank}_{\bm{f}}(\bm{x}_{i},c),j\in\bm{C}_{i}^{+}\}$ . Then,

[TABLE]

For Auc and Ap, the higher the better; whereas for Rkl and Cvg, the lower the better. To reduce statistical variability, results are averaged over 10 independent repetitions.

4.2 Learning with Full Labels

In this experiment, all elements in the training label matrix are observed. Performance on the test data is shown in Table 2. As expected, BR is the worst , since it treats each label independently without considering label correlations. MLLOC only considers local label correlations and LEML only makes use of the low-rank structure. Though ML-LRC takes advantage of both the low-rank structure and label correlations, only global label correlations are considered. As a result, GLOCAL is the best overall, as it models both global and local label correlations.

To show the example correlations learned by GLOCAL, we use two local groups extracted from the Image dataset. Figure 1 shows that local label correlation does vary from group to group, and is different from global correlation. For group 1, “sunset” is highly correlated with “desert” and “sea” (Figure 1(c)). This can also be seen from the images in Figure 1(a). Moreover, “trees” sometimes co-occurs with “deserts” (first and last images in Figure 1(a)). However, in group 2 (Figure 1(d)), “mountain” and “sea” often occur together and “trees” occurs less often with “desert” (Figure 1(b)). Figure 1(e) shows the learned global label correlation: “sea” and “sunset”, “mountain” and “trees” are positively correlated, whereas “desert” and “sea”, “desert” and “trees” are negatively correlated. All these correlations are consistent with intuition.

To further validate the effectiveness of global and local label correlations, we study two degenerate versions of GLOCAL: (i) GLObal, which uses only global label correlations; and (ii) loCAL, which uses only local label correlations. Note that the local groups obtained by clustering are not of equal sizes. For some datasets, the largest cluster contains more than $40\%$ of instances, while some small ones contain fewer than $5\%$ each. Global correlation is then dominated by the local correlation matrix of the largest cluster (Proposition 1), making the performance difference on the whole test set obscure. Hence, we focus on the performance of the small clusters. As can be seen from Table 3, using only global or local correlation may be good enough on some data sets (such as Health). On the other hand, considering both types of correlation as in GLOCAL achieves comparable or even better performance.

4.3 Learning with Missing Labels

In this experiment, we randomly sample $\rho\%$ of the elements in the label matrix as observed, and the rest as missing. Note that BR and MLLOC can only handle datasets with full labels. Hence, we first use MAXIDE xu2013speedup , a matrix completion algorithm for transductive multi-label learning, to fill in the missing labels before they can be applied. We use MBR for MAXIDE+BR, and MMLLOC for MAXIDE+MLLOC.

Tables 4 and 5 show the results on the training and test data, respectively.555To fit the tables on one page, we do not report the standard deviation. MBR, which performs worst, is also not shown.

As can be seen, performance increases with more observed entries in general. Overall, GLOCAL performs best at different $\rho$ ’s, as it simultaneously considers both global and local label correlations with label manifold regularization. In contrast, MBR and MMLLOC handle label recovery and learning separately. Moreover, MMLLOC takes only local label correlation, and MBR does not consider label correlations. As a result, they perform much worse than GLOCAL. Though LEML and ML-LRC perform learning with missing label recovery together, they consider only global correlation, and are thus often worse than GLOCAL.

4.4 Convergence

In this section, we empirically study the convergence of GLOCAL. Figure 2 shows the objective value w.r.t. the number of iterations for the full-label case. Because of the lack of space, results are only shown on the Arts, Business, Enron and Image datasets. As can be seen, the objective converges quickly in a few iterations. A similar phenomenon can be observed on the other datasets.

Table 6 shows the timing results on learning with missing labels (with $\rho=70$ ). GLOCAL and LEML train a classifier for all the labels jointly, and also can take advantage of the low-rank structure of either the model or label matrix during training. Thus, they are the fastest. However, GLOCAL has to be warm-started by Eqn. (1), and requires an additional clustering step to obtain local groups of the instances. Hence, it is slower than LEML. However, as have been observed in previous sections, GLOCAL outperforms LEML in terms of label recovery. ML-LRC uses a low-rank label correlation matrix. However, it does not reduce the size of the label matrix or model involved in each iteration, and so is slower than GLOCAL. MBR and MMLLOC require training a classifier for each label, and also an additional step to recover the missing labels. Thus, they are often the slowest, especially when the number of class labels is large. Similar results can be observed with $\rho=30$ , which are not reported here.

4.5 Sensitivity to Parameters

In this experiment, we study the influence of parameters, including the number of clusters $g$ , regularization parameters $\lambda_{3}$ and $\lambda_{4}$ (corresponding to the manifold regularizer for global and local label correlations, respectively), regularization parameter $\lambda_{2}$ for the Frobenius norm regularizer, and dimensionality $k$ of the latent representation. We vary one parameter, while keeping the others fixed at their best setting.

4.5.1 Varying the Number of Clusters $g$

Figure 3 shows the influence on the Enron dataset. When there is only one cluster, no local label correlation is considered. With more clusters, performance improves as more local label correlations are taken into account. When too many clusters are used, very few instances are placed in each cluster, and the local label correlations cannot be reliably estimated. Thus, the performance starts to deteriorate.

4.5.2 Influence of Label Manifold Regularizers ( $\lambda_{3}$ and $\lambda_{4}$ )

A larger $\lambda_{3}$ means higher importance of global label correlation, whereas a larger $\lambda_{4}$ means higher importance of local label correlation. Figures 5 and 5 show their effects on the Enron dataset. When $\lambda_{3}=0$ , only local label correlations are considered, and the performance is poor. With increasing $\lambda_{3}$ , performance improves. However, when $\lambda_{3}$ is very large, performance deteriorates as the global label correlations dominate. A similar phenomenon can be observed for $\lambda_{4}$ .

4.5.3 Varying the Latent Representation Dimensionality $k$

Figure 6 shows the effect of varying $k$ on the Enron dataset. As can be seen, when $k$ is too small, the latent representation cannot capture enough information. With increasing $k$ , performance improves. When $k$ is too large, the low-rank structure is not fully utilized, and performance starts to get worse.

4.5.4 Influence of $\lambda_{2}$

Figure 7 shows the effect of varying $\lambda_{2}$ on the Enron dataset. As can be seen, GLOCAL is not sensitive to this parameter.

5 Conclusion

In this paper, we proposed a new multi-label correlation learning approach GLOCAL, which simultaneously recovers the missing labels, trains the classifier and exploits both global and local label correlations, through learning a latent label representation and optimizing the label manifolds. Compared with the previous work, it is the first to exploit both global and local label correlations, which directly learns the Laplacian matrix without requiring any other prior knowledge on label correlations. As a result, the classifier outputs and label correlations best match each other, both globally and locally. Moreover, GLOCAL provides a unified solution for both full-label and missing-label multi-label learning. Experimental results show that our approach outperforms the state-of-the-art multi-label learning approaches on learning with both full labels and missing labels. In our work, we handle the case that label correlations are symmetric. In many situations, correlations can be asymmetric. For example, “mountain” are highly correlated to “tree”, since it is very common that a mountain has trees in it. However, “tree” may be less correlated to “mountain”, because trees can be found not only in mountains, but often in the streets, parks, etc. So it is desirable to study the asymmetric label correlations in our future work.

Acknowledgment

This research was supported by NSFC (61333014), 111 Project (B14020), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research , 7:2399–2434, 2006.
2(2) N. Boumal., B. Mishra, P.-A. Absil., and R. Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. Journal of Machine Learning Research , 15:1455–1459, 2014.
3(3) M. Boutell, J. Luo, X. Shen, and C. Brown. Learning multi-label scene classification. Pattern Recognition , 37(9):1757–1771, 2004.
4(4) H.-Y. Chuang, E. Lee, Y.-T. Liu, D. Lee, and T. Ideker. Network-based classification of breast cancer metastasis. Molecular Systems Biology , 3(1):140–149, 2007.
5(5) F. Chung. Spectral graph theory , volume 92. American Mathematical Soc., 1997.
6(6) R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research , 9:1871–1874, 2008.
7(7) J. Fürnkranz, E. Hüllermeier, E. Mencía, and K. Brinker. Multilabel classification via calibrated label ranking. Machine Learning , 73(2):133–153, 2008.
8(8) A. Goldberg, B. Recht, J. Xu, R. Nowak, and X. Zhu. Transduction with matrix completion: Three birds with one stone. In Advances in Neural Information Processing Systems 23 , pages 757–765. 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Multi-Label Learning with

Abstract

keywords:

1 Introduction

2 Related Work

3 The Proposed Approach

3.1 Basic Model

3.2 Global and Local Manifold Regularizers

Lemma 1

Proposition 1

3.3 Learning by Alternating Minimization

3.3.1 Updating Zm\bm{Z}_{m}Zm​

3.3.2 Updating V\bm{V}V

3.3.3 Updating U\bm{U}U

3.3.4 Updating W\bm{W}W

4 Experiments

4.1 Setup

4.1.1 Data sets

4.1.2 Baselines

4.1.3 Performance Evaluation

4.2 Learning with Full Labels

4.3 Learning with Missing Labels

4.4 Convergence

4.5 Sensitivity to Parameters

4.5.1 Varying the Number of Clusters ggg

4.5.2 Influence of Label Manifold Regularizers (λ3\lambda_{3}λ3​ and λ4\lambda_{4}λ4​)

4.5.3 Varying the Latent Representation Dimensionality kkk

4.5.4 Influence of λ2\lambda_{2}λ2​

5 Conclusion

Acknowledgment

3.3.1 Updating $\bm{Z}_{m}$

3.3.2 Updating $\bm{V}$

3.3.3 Updating $\bm{U}$

3.3.4 Updating $\bm{W}$

4.5.1 Varying the Number of Clusters $g$

4.5.2 Influence of Label Manifold Regularizers ( $\lambda_{3}$ and $\lambda_{4}$ )

4.5.3 Varying the Latent Representation Dimensionality $k$

4.5.4 Influence of $\lambda_{2}$