Complementary to Multiple Labels: A Correlation-Aware Correction   Approach

Yi Gao; Miao Xu; Min-Ling Zhang

arXiv:2302.12987·cs.LG·June 25, 2024

Complementary to Multiple Labels: A Correlation-Aware Correction Approach

Yi Gao, Miao Xu, Min-Ling Zhang

PDF

Open Access

TL;DR

This paper introduces a correlation-aware correction method for multi-labeled complementary label learning, addressing the challenge of estimating transition matrices without multi-labeled data and improving multi-label classification accuracy.

Contribution

It proposes a novel two-step transition matrix estimation approach that incorporates label correlations, enhancing multi-label complementary label learning performance.

Findings

01

The proposed method outperforms existing approaches in experiments.

02

The correction of transition matrices improves multi-label classification accuracy.

03

The approach is classifier-consistent and mitigates noise overfitting.

Abstract

\textit{Complementary label learning} (CLL) requires annotators to give \emph{irrelevant} labels instead of relevant labels for instances. Currently, CLL has shown its promising performance on multi-class data by estimating a transition matrix. However, current multi-class CLL techniques cannot work well on multi-labeled data since they assume each instance is associated with one label while each multi-labeled instance is relevant to multiple labels. Here, we show theoretically how the estimated transition matrix in multi-class CLL could be distorted in multi-labeled cases as they ignore co-existing relevant labels. Moreover, theoretical findings reveal that calculating a transition matrix from label correlations in \textit{multi-labeled CLL} (ML-CLL) needs multi-labeled data, while this is unavailable for ML-CLL. To solve this issue, we propose a two-step method to estimate the…

Tables6

Table 1. TABLE I: Statistics of datasets.

Datasets	$\| 𝒮 \|$	$d i m (𝒮)$	$L (𝒮)$	$L C a r d (𝒮)$
scene	2407	294	6	1.07
yeast	2417	103	14	4.23
eurlex_dc	8636	5000	15	1.02
eurlex_sm	13270	5000	15	1.74
corel5k	4194	499	15	1.70
corel16k	11103	120	15	1.77
bookmark	38912	2150	15	1.25
delicious	14784	500	15	4.32

Table 2. TABLE II: Experimental results (mean ± std) on training data with uniform complementary labels. The best performance of each dataset is presented in boldface , where ∙ ⁣ / ⁣ ∘ ∙ \bullet/\circ indicates whether MLCL is superior/inferior to baselines (with 5% t-test).

Methods	ML-KNN	LIFT	fpml	PML-lc	PML-LRS	L-UW	MLCL
Ranking loss $↓$
scene	.340±.032 $∙$	.289±.020 $∙$	.504±.025 $∙$	.490±.025 $∙$	.258±.007 $\circ$	.372±.028 $∙$	.259±.030
yeast	.247±.012 $∙$	.298±.012 $∙$	.233±.013 $∙$	.251±.015 $∙$	.464±.019 $∙$	.214±.011	.211±.013
eurlex_dc	.303±.016 $∙$	.286±.016 $∙$	.488±.033 $∙$	.347±.025 $∙$	.316±.011 $∙$	.598±.024 $∙$	.229±.026
eurlex_sm	.336±.010 $∙$	.346±.012 $∙$	.488±.006 $∙$	.436±.011 $∙$	.332±.009 $∙$	.646±.015 $∙$	.312±.014
corel5k	.379±.034	.433±.037 $∙$	.444±.026 $∙$	.406±.075 $∙$	.334±.009 $\circ$	.367±.031	.349±.035
corel16k	.328±.047	.392±.027 $∙$	.420±.033 $∙$	.457±.046 $∙$	.303±.005	.303±.035	.289±.042
bookmark	.384±.006 $∙$	.310±.007 $∙$	.469±.019 $∙$	.454±.036 $∙$	.260±.004	.303±.010 $∙$	.252±.013
delicious	.398±.004 $∙$	.383±.003 $∙$	.438±.008 $∙$	.445±.015 $∙$	.305±.002	.302±.006	.310±.003
One Error $↓$
scene	.692±.030 $∙$	.605±.023 $∙$	.815±.027 $∙$	.717±.021 $∙$	.540±.023 $∙$	.609±.041 $∙$	.427±.018
yeast	.297±.029 $∙$	.284±.028 $∙$	.251±.025	.583±.026 $∙$	.738±.102 $∙$	.251±.025	.251±.023
eurlex_dc	.776±.031 $∙$	.670±.013 $∙$	.925±.016 $∙$	.774±.015 $∙$	.847±.010 $∙$	.837±.034 $∙$	.594±.035
eurlex_sm	.689±.012 $∙$	.679±.009 $∙$	.872±.011 $∙$	.662±.012	.731±.005 $∙$	.696±.008 $∙$	.656±.029
corel5k	.815±.048 $∙$	.842±.056 $∙$	.854±.035 $∙$	.811±.062 $∙$	.756±.010	.769±.034	.736±.065
corel16k	.736±.056	.789±.046 $∙$	.816±.025 $∙$	.946±.028 $∙$	.730±.000 $∙$	.693±.057	.690±.056
bookmark	.801±.006 $∙$	.649±.016 $∙$	.885±.020 $∙$	.798±.005 $∙$	.584±.005 $∙$	.590±.022 $∙$	.509±.012
delicious	.592±.018 $∙$	.533±.015 $∙$	.618±.017 $∙$	.679±.011 $∙$	.452±.007	.467±.023 $∙$	.448±.016
Hamming loss $↓$
scene	.820±.002 $∙$	.820±.003 $∙$	.819±.002 $∙$	.251±.007	.814±.000 $∙$	.518±.042 $∙$	.264±.027
yeast	.697±.012 $∙$	.697±.013 $∙$	.697±.013 $∙$	.268±.010 $∙$	.316±.000 $∙$	.243±.010	.235±.008
eurlex_dc	.932±.000 $∙$	.932±.000 $∙$	.118±.006 $∙$	.104±.002 $∙$	.890±.039 $∙$	.806±.015 $∙$	.092±.005
eurlex_sm	.883±.001 $∙$	.883±.001 $∙$	.148±.005 $∙$	.138±.002	.825±.027 $∙$	.773±.008 $∙$	.139±.005
corel5k	.886±.007 $∙$	.887±.007 $∙$	.887±.007 $∙$	.155±.004	.869±.002 $∙$	.463±.018 $∙$	.229±.068
corel16k	.882±.009 $∙$	.882±.009 $∙$	.882±.009 $∙$	.177±.011 $\circ$	.862±.001 $∙$	.423±.033 $∙$	.202±.067
bookmark	.917±.001 $∙$	.916±.001 $∙$	.420±.009 $∙$	.123±.001 $\circ$	.813±.001 $∙$	.409±.014 $∙$	.140±.004
delicious	.711±.003 $∙$	.711±.003 $∙$	.711±.003 $∙$	.394±.011 $∙$	.459±.002 $∙$	.369±.027 $∙$	.289±.004
Coverage $↓$
scene	.299±.026 $∙$	.256±.017 $∙$	.434±.021 $∙$	.420±.021 $∙$	.230±.006	.328±.022 $∙$	.234±.025
yeast	.579±.018 $∙$	.649±.020 $∙$	.553±.033 $∙$	.506±.023	.742±.027 $∙$	.525±.017	.525±.021
eurlex_dc	.285±.014 $∙$	.269±.015 $∙$	.458±.031 $∙$	.326±.023 $∙$	.298±.010 $∙$	.334±.017 $∙$	.204±.023
eurlex_sm	.416±.010 $∙$	.427±.013 $∙$	.569±.010 $∙$	.509±.013 $∙$	.419±.010 $∙$	.519±.008 $∙$	.365±.014
corel5k	.473±.034	.516±.035 $∙$	.529±.028 $∙$	.492±.072	.429±.008	.457±.038	.445±.048
corel16k	.430±.044	.488±.027 $∙$	.513±.035 $∙$	.537±.051 $∙$	.405±.008	.407±.033	.393±.042
bookmark	.359±.007 $∙$	.328±.008 $∙$	.475±.019 $∙$	.458±.035 $∙$	.280±.004	.292±.011 $∙$	.279±.011
delicious	.712±.006 $∙$	.703±.004 $∙$	.726±.009 $∙$	.695±.009 $∙$	.609±.003 $\circ$	.613±.006 $\circ$	.632±.007
Average Precision $↑$
scene	.543±.024 $∙$	.600±.017 $∙$	.417±.021 $∙$	.465±.018 $∙$	.637±.011 $∙$	.568±.026 $∙$	.699±.017
yeast	.677±.019 $∙$	.636±.017 $∙$	.688±.017 $∙$	.610±.016 $∙$	.459±.032 $∙$	.712±.020	.718±.019
eurlex_dc	.412±.018 $∙$	.471±.012 $∙$	.232±.022 $∙$	.373±.015 $∙$	.346±.009 $∙$	.250±.031 $∙$	.549±.025
eurlex_sm	.419±.010 $∙$	.421±.010 $∙$	.273±.006 $∙$	.367±.009 $∙$	.402±.005 $∙$	.285±.009 $∙$	.474±.017
corel5k	.355±.035 $∙$	.307±.038 $∙$	.297±.023 $∙$	.330±.044 $∙$	.397±.010	.371±.028	.391±.037
corel16k	.405±.050	.350±.035 $∙$	.325±.022 $∙$	.248±.026 $∙$	.424±.006	.437±.044	.449±.049
bookmark	.383±.007 $∙$	.480±.010 $∙$	.267±.019 $∙$	.329±.016 $∙$	.534±.004 $∙$	.506±.014 $∙$	.584±.013
delicious	.487±.006 $∙$	.511±.004 $∙$	.457±.006 $∙$	.446±.010 $∙$	.580±.002	.570±.009	.572±.005

Table 3. TABLE III: Experimental results (mean ± std) on training data with biased complementary labels. The best performance of each dataset is presented in boldface , where ∙ ⁣ / ⁣ ∘ ∙ \bullet/\circ represents whether MLCL is superior/inferior to baselines (with 5% t-test).

Methods	ML-KNN	LIFT	fpml	PML-lc	PML-LRS	L-UW	MLCL
Ranking loss $↓$
scene	.086±.015 $\circ$	.319±.025	.486±.027 $∙$	.492±.019 $∙$	.258±.013 $\circ$	.368±.025 $∙$	.326±.050
yeast	.240±.014 $∙$	.297±.016 $∙$	.227±.013 $∙$	.248±.012 $∙$	.454±.024 $∙$	.202±.012	.199±.012
eurlex_dc	.668±.009 $∙$	.636±.021 $∙$	.537±.015 $∙$	.349±.028 $∙$	.326±.009	.586±.036 $∙$	.308±.034
eurlex_sm	.364±.020 $∙$	.392±.014 $∙$	.499±.019 $∙$	.447±.012 $∙$	.333±.009 $∙$	.641±.015 $∙$	.316±.016
corel5k	.324±.038 $\circ$	.431±.030 $∙$	.474±.028 $∙$	.386±.047	.357±.012	.382±.033	.358±.039
corel16k	.413±.063 $∙$	.431±.041 $∙$	.454±.033 $∙$	.471±.068 $∙$	.375±.015	.373±.029	.357±.040
bookmark	.567±.007 $∙$	.449±.042 $∙$	.552±.018 $∙$	.491±.016 $∙$	.244±.003 $∙$	.326±.008 $∙$	.211±.011
delicious	.430±.005 $∙$	.413±.005 $∙$	.452±.008 $∙$	.433±.011 $∙$	.314±.003 $\circ$	.349±.012 $\circ$	.360±.008
One Error $↓$
scene	.228±.032 $\circ$	.669±.043 $∙$	.803±.038 $∙$	.720±.018 $∙$	.613±.017 $∙$	.696±.025 $∙$	.553±.054
yeast	.330±.032 $∙$	.280±.025 $∙$	.254±.028	.583±.027 $∙$	.546±.097 $∙$	.256±.025	.254±.024
eurlex_dc	.977±.005 $∙$	.959±.014 $∙$	.947±.008 $∙$	.774±.015 $∙$	.822±.004 $∙$	.822±.038 $∙$	.695±.074
eurlex_sm	.699±.016 $∙$	.753±.036 $∙$	.886±.024 $∙$	.664±.014	.737±.011 $∙$	.704±.012 $∙$	.650±.045
corel5k	.738±.067	.851±.038 $∙$	.861±.034 $∙$	.828±.059 $∙$	.747±.016	.792±.039 $∙$	.752±.037
corel16k	.780±.061 $∙$	.827±.049 $∙$	.837±.025 $∙$	.952±.021 $∙$	.730±.000	.731±.053	.707±.063
bookmark	.906±.007 $∙$	.804±.037 $∙$	.925±.008 $∙$	.792±.004 $∙$	.576±.003 $∙$	.635±.022 $∙$	.502±.008
delicious	.585±.012 $∙$	.557±.013 $∙$	.617±.025 $∙$	.681±.012 $∙$	.434±.006 $\circ$	.485±.016 $∙$	.463±.017
Hamming loss $↓$
scene	.088±.009 $\circ$	.819±.002 $∙$	.820±.002 $∙$	.252±.006	.814±.000 $∙$	.523±.048 $∙$	.290±.029
yeast	.697±.012 $∙$	.697±.013 $∙$	.697±.013 $∙$	.268±.010 $∙$	.316±.000 $∙$	.253±.017 $∙$	.239±.008
eurlex_dc	.932±.000 $∙$	.932±.000 $∙$	.118±.007 $∙$	.104±.002	.889±.039 $∙$	.799±.035 $∙$	.109±.011
eurlex_sm	.883±.001 $∙$	.883±.001 $∙$	.148±.005 $∙$	.139±.002	.825±.027 $∙$	.772±.009 $∙$	.138±.007
corel5k	.114±.008 $\circ$	.887±.007 $∙$	.887±.007 $∙$	.157±.003 $∙$	.869±.002 $∙$	.498±.012 $∙$	.208±.033
corel16k	.882±.009 $∙$	.882±.009 $∙$	.882±.009 $∙$	.178±.010	.862±.001 $∙$	.481±.028 $∙$	.207±.086
bookmark	.917±.001 $∙$	.916±.001 $∙$	.419±.009 $∙$	.122±.001 $\circ$	.813±.003 $∙$	.549±.046 $∙$	.146±.003
delicious	.711±.003 $∙$	.711±.003 $∙$	.711±.003 $∙$	.388±.013 $∙$	.459±.002 $∙$	.453±.015 $∙$	.304±.005
Coverage $↓$
scene	.086±.013 $\circ$	.280±.020	.420±.023 $∙$	.420±.016 $∙$	.229±.011 $\circ$	.321±.021 $∙$	.286±.041
yeast	.551±.017 $∙$	.638±.028 $∙$	.533±.012 $∙$	.493±.025	.723±.040 $∙$	.500±.018	.498±.021
eurlex_dc	.626±.008 $∙$	.596±.019 $∙$	.504±.014 $∙$	.328±.026 $∙$	.306±.009 $∙$	.333±.018 $∙$	.274±.030
eurlex_sm	.432±.018 $∙$	.456±.014 $∙$	.579±.015 $∙$	.520±.015 $∙$	.418±.009 $∙$	.512±.009 $∙$	.362±.016
corel5k	.419±.055	.515±.024 $∙$	.555±.031 $∙$	.480±.041	.451±.013	.470±.036	.449±.038
corel16k	.498±.052 $∙$	.521±.038 $∙$	.542±.035 $∙$	.533±.066 $∙$	.454±.018	.468±.030	.453±.039
bookmark	.565±.006 $∙$	.455±.039 $∙$	.553±.017 $∙$	.492±.014 $∙$	.265±.003 $∙$	.308±.013 $∙$	.231±.011
delicious	.736±.004 $∙$	.723±.005 $∙$	.737±.008 $∙$	.691±.009	.625±.003 $\circ$	.671±.012 $\circ$	.688±.006
Average Precision $↑$
scene	.860±.020 $\circ$	.559±.028 $∙$	.428±.026 $∙$	.462±.014 $∙$	.608±.013	.529±.020 $∙$	.618±.046
yeast	.670±.023 $∙$	.634±.016 $∙$	.691±.022 $∙$	.614±.015 $∙$	.500±.026 $∙$	.719±.020	.726±.018
eurlex_dc	.145±.005 $∙$	.166±.016 $∙$	.201±.009 $∙$	.371±.020 $∙$	.357±.005 $∙$	.266±.031 $∙$	.456±.061
eurlex_sm	.405±.013 $∙$	.373±.016 $∙$	.262±.016 $∙$	.366±.010 $∙$	.400±.007 $∙$	.282±.011 $∙$	.482±.025
corel5k	.409±.040	.300±.030 $∙$	.282±.017 $∙$	.325±.048 $∙$	.392±.017	.352±.032	.380±.037
corel16k	.355±.054 $∙$	.318±.033 $∙$	.301±.024 $∙$	.240±.030 $∙$	.393±.054	.384±.036	.407±.047
bookmark	.219±.004 $∙$	.320±.037 $∙$	.212±.007 $∙$	.320±.004 $∙$	.544±.003 $∙$	.469±.014 $∙$	.599±.008
delicious	.473±.006 $∙$	.490±.006 $∙$	.450±.008 $∙$	.449±.010 $∙$	.581±.002 $\circ$	.544±.010	.544±.009

Table 4. TABLE IV: Ablation experimental results (mean ± std) on training data with uniform complementary labels. The best performance is in boldface .

Methods	Uniform complementary labels				Biased complementary labels
	scene	yeast	eurlex_dc	corel5k	scene	yeast	eurlex_dc	corel5k
	Hamming loss $↓$
MLCL	.264±.027	.235±.008	.092±.005	.229±.068	.290±.029	.239±.008	.109±.011	.208±.033
Without $𝐂$	.290±.039	.421±.011	.109±.018	.466±.025	.294±.029	.409±.012	.088±.004	.444±.031
Without ${\bar{L}}_{m s e}$	.510±.044	.229±.007	.509±.043	.461±.053	.481±.047	.230±.009	.512±.046	.489±.036
	Ranking loss $↓$
MLCL	.259±.030	.211±.013	.229±.026	.349±.035	.326±.050	.199±.012	.308±.034	.358±.039
Without $𝐂$	.282±.063	.419±.018	.277±.041	.487±.021	.348±.046	.406±.016	.268±.024	.467±.026
Without ${\bar{L}}_{m s e}$	.379±.024	.216±.010	.303±.028	.362±.030	.353±.018	.204±.011	.320±.025	.387±.027
	One error $↓$
MLCL	.427±.018	.251±.023	.594±.035	.736±.065	.553±.054	.254±.024	.695±.074	.752±.037
Without $𝐂$	.474±.047	.633±.043	.708±.106	.866±.019	.560±.042	.612±.051	.564±.029	.855±.027
Without ${\bar{L}}_{m s e}$	.607±.037	.250±.025	.740±.048	.734±.058	.686±.013	.256±.025	.753±.044	.773±.068
	Coverage $↓$
MLCL	.234±.025	.525±.021	.204±.023	.445±.048	.286±.041	.498±.021	.274±.030	.449±.038
Without $𝐂$	.255±.055	.683±.029	.247±.035	.565±.032	.306±.039	.660±.023	.240±.023	.547±.031
Without ${\bar{L}}_{m s e}$	.334±.020	.527±.011	.249±.024	.451±.035	.310±.015	.501±.015	.265±.022	.473±.023
	Average precision $↑$
MLCL	.699±.017	.718±.019	.549±.025	.391±.037	.618±.046	.726±.018	.456±.061	.380±.037
Without $𝐂$	.671±.045	.472±.018	.469±.085	.274±.014	.611±.038	.489±.015	.447±.021	.289±.022
Without ${\bar{L}}_{m s e}$	.566±.023	.711±.019	.426±.040	.389±.041	.541±.013	.717±.020	.411±.034	.359±.050

Table 5. TABLE V: Parameter sensitivity analysis on uniform complementary-label data, where metric is average precision . The best performance is in boldface .

$β$	scene	yeast	eurlex_dc	eurlex_sm	corel5k	corel16k	bookmark	delicious
0.1	.678±.017	.714±.019	.545±.019	.451±.025	.374±.033	.444±.046	.565±.007	.554±.005
0.3	.683±.015	.716±.018	.549±.021	.460±.021	.378±.032	.447±.047	.579±.011	.565±.005
0.5	.687±.016	.718±.018	.547±.022	.463±.016	.385±.031	.447±.048	.583±.008	.575±.005
0.8	.693±.016	.718±.018	.541±.022	.469±.018	.387±.037	.448±.048	.582±.007	.572±.006
1	.699±.017	.718±.019	.549±.025	.474±.017	.391±.037	.449±.049	.584±.013	.572±.005

Table 6. TABLE VI: Experimental results (mean ± std) of five criteria.“Fully supervised” is the linear model training with the fully supervised data (fully supervised MLL). “CL” denotes each instance is associated with a complementary label sampled uniformly. “CL & RL” uses the linear model with the loss function Eq.( 12 ) to train, where each instance is equipped with a complementary label and a relevant label .

Datasets	scene	yeast	eurlex_dc	eurlex_sm	corel5k	corel16k	bookmark	delicious
Hamming loss $↓$
Fully supervised	.120±.013	.208±.009	.004±.000	.033±.001	.198±.012	.196±.012	.098±.004	.276±.006
CL	.264±.027	.235±.008	.092±.005	.139±.005	.229±.068	.202±.067	.140±.004	.289±.004
CL & RL	.124±.008	.225±.010	.005±.001	.053±.002	.178±.012	.172±.010	.085±.002	.285±.004
Ranking loss $↓$
Fully supervised	.075±.009	.169±.009	.003±.001	.019±.001	.258±.029	.222±.029	.090±.005	.226±.004
CL	.259±.030	.211±.013	.229±.026	.312±.014	.349±.035	.289±.042	.252±.013	.310±.003
CL & RL	.082±.011	.191±.011	.005±.001	.044±.002	.268±.031	.227±.021	.102±.004	.267±.004
One Error $↓$
Fully supervised	.222±.032	.223±.023	.019±.004	.069±.005	.627±.038	.588±.056	.313±.009	.340±.012
CL	.427±.018	.251±.023	.594±.035	.656±.029	.736±.065	.690±.056	.509±.012	.448±.016
CL & RL	.229±.033	.255±.032	.022±.005	.098±.007	.639±.040	.600±.044	.324±.007	.398±.017
Coverage $↓$
Fully supervised	.077±.009	.451±.019	.004±.000	.074±.002	.347±.044	.315±.024	.112±.005	.527±.007
CL	.234±.025	.525±.021	.204±.023	.365±.014	.445±.048	.393±.042	.279±.011	.632±.007
CL & RL	.084±.010	.474±.021	.006±.001	.113±.004	.363±.048	.326±.020	.125±.004	.564±.006
Average Precision $↑$
Fully supervised	.868±.018	.760±.015	.988±.003	.943±.004	.494±.024	.530±.038	.766±.007	.662±.005
CL	.699±.017	.718±.019	.549±.025	.474±.017	.391±.037	.449±.049	.584±.013	.572±.005
CL & RL	.860±.019	.734±.018	.985±.004	.899±.004	.485±.028	.523±.030	.753±.006	.618±.005

Equations114

R_{L} (f) = E_{p (x, Y)} [L (f (x), y)],

R_{L} (f) = E_{p (x, Y)} [L (f (x), y)],

R_{\overset{ˉ}{L}} (f) = E_{p (x, \overset{y}{ˉ})} [\overset{ˉ}{L} (f (x), \overset{ˉ}{y})],

R_{\overset{ˉ}{L}} (f) = E_{p (x, \overset{y}{ˉ})} [\overset{ˉ}{L} (f (x), \overset{ˉ}{y})],

p (\overset{y}{ˉ}^{j} = 1∣ x) = C \in Y^{'}, l_{j} \in / C \sum p (\overset{y}{ˉ}^{j} = 1∣ Y = C) p (Y = C ∣ x),

p (\overset{y}{ˉ}^{j} = 1∣ x) = C \in Y^{'}, l_{j} \in / C \sum p (\overset{y}{ˉ}^{j} = 1∣ Y = C) p (Y = C ∣ x),

p (\overset{y}{ˉ}^{j} = 1∣ x) = C \in Y^{'}, l_{j} \in / C \sum p (\overset{y}{ˉ}^{j} = 1∣ Y = C) p (Y = C ∣ x)

p (\overset{y}{ˉ}^{j} = 1∣ x) = C \in Y^{'}, l_{j} \in / C \sum p (\overset{y}{ˉ}^{j} = 1∣ Y = C) p (Y = C ∣ x)

\geq k = 1, k \neq = j \sum K p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1) p (y^{k} = 1∣ x) .

ℓ_{j} = k = 1 \sum K ∣ T_{k j} - Q_{k j} ∣.

ℓ_{j} = k = 1 \sum K ∣ T_{k j} - Q_{k j} ∣.

T_{z_{1} j} = \frac{p ( y ˉ ^{j} = 1∣ x )}{p ( y ^{z_{2}} = 1∣ y ˉ ^{j} = 1 , y ^{z_{1}} = 1 , x ) p ( y ^{z_{1}} = 1∣ x )},

T_{z_{1} j} = \frac{p ( y ˉ ^{j} = 1∣ x )}{p ( y ^{z_{2}} = 1∣ y ˉ ^{j} = 1 , y ^{z_{1}} = 1 , x ) p ( y ^{z_{1}} = 1∣ x )},

T_{z_{2} j} = \frac{p ( y ˉ ^{j} = 1∣ x )}{p ( y ^{z_{1}} = 1∣ y ˉ ^{j} = 1 , y ^{z_{2}} = 1 , x ) p ( y ^{z_{2}} = 1∣ x )},

ℓ_{j} \geq 2 (\frac{1}{ξ ^{2}} - 1) p (\overset{y}{ˉ}^{j} = 1∣ x),

ℓ_{j} \geq 2 (\frac{1}{ξ ^{2}} - 1) p (\overset{y}{ˉ}^{j} = 1∣ x),

ℓ_{j} \geq m (\frac{1}{ξ ^{m}} - 1) p (\overset{y}{ˉ}^{j} = 1∣ x),

ℓ_{j} \geq m (\frac{1}{ξ ^{m}} - 1) p (\overset{y}{ˉ}^{j} = 1∣ x),

S_{k j}

S_{k j}

= p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1) \int p (x ∣ \overset{y}{ˉ}^{j} = 1, y^{k} = 1) d x

= \int p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1, x) p (x ∣ y^{k} = 1) d x

= E_{p (x ∣ y^{k} = 1)} [p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1, x)],

S_{k j}

S_{k j}

= \frac{1}{∣ A _{k} ∣} x \in A_{k} \sum p (\overset{y}{ˉ}^{j} = 1∣ x) .

\overset{ˉ}{f} (x) = T^{T} f (x),

\overset{ˉ}{f} (x) = T^{T} f (x),

\overset{ˉ}{L} (f (x), \overset{ˉ}{y}) = L (\overset{ˉ}{f} (x), \overset{ˉ}{y}) = L (T^{T} f (x), \overset{ˉ}{y}) .

\overset{ˉ}{L} (f (x), \overset{ˉ}{y}) = L (\overset{ˉ}{f} (x), \overset{ˉ}{y}) = L (T^{T} f (x), \overset{ˉ}{y}) .

\overset{ˉ}{L} (f (x), \overset{ˉ}{y}) =

\overset{ˉ}{L} (f (x), \overset{ˉ}{y}) =

\overset{ˉ}{L}_{m se} (f (x), \overset{ˉ}{y}) = \overset{ˉ}{y} - T^{T} f (x)_{F}^{2} .

\overset{ˉ}{L}_{m se} (f (x), \overset{ˉ}{y}) = \overset{ˉ}{y} - T^{T} f (x)_{F}^{2} .

\overset{ˉ}{L} (f (x), \overset{ˉ}{y}) = \overset{ˉ}{L} (f (x), \overset{ˉ}{y}) + β \overset{ˉ}{L}_{m se} (f (x), \overset{ˉ}{y}),

\overset{ˉ}{L} (f (x), \overset{ˉ}{y}) = \overset{ˉ}{L} (f (x), \overset{ˉ}{y}) + β \overset{ˉ}{L}_{m se} (f (x), \overset{ˉ}{y}),

\tilde{L} (f (x), \overset{ˉ}{y}, \tilde{y}) = \overset{ˉ}{L} (f (x), \overset{ˉ}{y}) + ∥ \tilde{y} - f (x) ∥_{F}^{2},

\tilde{L} (f (x), \overset{ˉ}{y}, \tilde{y}) = \overset{ˉ}{L} (f (x), \overset{ˉ}{y}) + ∥ \tilde{y} - f (x) ∥_{F}^{2},

p (\overset{y}{ˉ}^{j} = 1∣ x) = C \in Y^{'}, l_{j} \in / C \sum p (\overset{y}{ˉ}^{j} = 1∣ Y = C) p (Y = C ∣ x) \geq k = 1, k \neq = j \sum K p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1) p (y^{k} = 1∣ x) .

p (\overset{y}{ˉ}^{j} = 1∣ x) = C \in Y^{'}, l_{j} \in / C \sum p (\overset{y}{ˉ}^{j} = 1∣ Y = C) p (Y = C ∣ x) \geq k = 1, k \neq = j \sum K p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1) p (y^{k} = 1∣ x) .

p (\overset{y}{ˉ}^{j} = 1∣ x)

p (\overset{y}{ˉ}^{j} = 1∣ x)

= C \in Y^{'}, l_{j} \in / C \sum p (\overset{y}{ˉ}^{j} = 1∣ Y = C, x) p (Y = C ∣ x)

= C \in Y^{'}, l_{j} \in / C \sum p (\overset{y}{ˉ}^{j} = 1, Y = C ∣ x)

= C \in Y^{'}, l_{j} \in / C \sum p (Y = C ∣ \overset{y}{ˉ}^{j} = 1, x) p (\overset{y}{ˉ}^{j} = 1∣ x) .

p (\overset{y}{ˉ}^{j} = 1∣ x)

p (\overset{y}{ˉ}^{j} = 1∣ x)

\geq C \in Y^{'}, l_{j} \in / C \sum k = 1, k \neq = j, l_{k} \in C \sum K p (y^{k} = 1∣ \overset{y}{ˉ}^{j} = 1, x) p (\overset{y}{ˉ}^{j} = 1∣ x) ∵ \sum_{k = 1, l_{k} \in / C}^{K} p (y^{k} = 0∣ \overset{y}{ˉ}^{j} = 1, x) \geq 0

= C \in Y^{'}, l_{j} \in / C \sum k = 1, k \neq = j, l_{k} \in C \sum K p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1, x) p (y^{k} = 1∣ x)

= C \in Y^{'}, l_{j} \in / C \sum k = 1, k \neq = j \sum K p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1, x) p (y^{k} = 1∣ x) ∵ p (y^{k} = 1∣ x) = 0 if \l_{k} \in / Y

= k = 1, k \neq = j \sum K C \in Y^{'}, l_{j} \in / C \sum p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1, x) p (y^{k} = 1∣ x)

= k = 1, k \neq = j \sum K (2^{K - 1} - 1) p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1, x) p (y^{k} = 1∣ x)

\geq k = 1, k \neq = j \sum K p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1, x) p (y^{k} = 1∣ x)

= k = 1, k \neq = j \sum K p (\overset{y}{ˉ}^{j} = 1∣ y^{k} = 1) p (y^{k} = 1∣ x) .

T_{z_{1} j} = \frac{p ( y ˉ ^{j} = 1∣ x )}{p ( y ^{z_{2}} = 1∣ y ˉ ^{j} = 1 , y ^{z_{1}} = 1 , x ) p ( y ^{z_{1}} = 1∣ x )},

T_{z_{1} j} = \frac{p ( y ˉ ^{j} = 1∣ x )}{p ( y ^{z_{2}} = 1∣ y ˉ ^{j} = 1 , y ^{z_{1}} = 1 , x ) p ( y ^{z_{1}} = 1∣ x )},

T_{z_{2} j} = \frac{p ( y ˉ ^{j} = 1∣ x )}{p ( y ^{z_{1}} = 1∣ y ˉ ^{j} = 1 , y ^{z_{2}} = 1 , x ) p ( y ^{z_{2}} = 1∣ x )},

ℓ_{j} \geq 2 (\frac{1}{ξ ^{2}} - 1) p (\overset{y}{ˉ}^{j} = 1∣ x),

ℓ_{j} \geq 2 (\frac{1}{ξ ^{2}} - 1) p (\overset{y}{ˉ}^{j} = 1∣ x),

p (\overset{y}{ˉ}^{j} = 1∣ x)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Music and Audio Processing · Machine Learning and Data Classification

Full text

Complementary to Multiple Labels: A Correlation-Aware Correction Approach

Yi Gao, Miao Xu, and Min-Ling Zhang Yi Gao is with the School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China and the Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China. E-mail: [email protected] Miao Xu is with The University of Queensland, Australia. E-mail: [email protected] Min-Ling Zhang (corresponding author) is with the School of Computer Science and Engineering,Southeast University, Nanjing 210096, China and the Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China. E-mail: [email protected] Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

Complementary label learning (CLL) requires annotators to give irrelevant labels instead of relevant labels for instances. Currently, CLL has shown its promising performance on multi-class data by estimating a transition matrix. However, current multi-class CLL techniques cannot work well on multi-labeled data since they assume each instance is associated with one label while each multi-labeled instance is relevant to multiple labels. Here, we show theoretically how the estimated transition matrix in multi-class CLL could be distorted in multi-labeled cases as they ignore co-existing relevant labels. Moreover, theoretical findings reveal that calculating a transition matrix from label correlations in multi-labeled CLL (ML-CLL) needs multi-labeled data, while this is unavailable for ML-CLL. To solve this issue, we propose a two-step method to estimate the transition matrix from candidate labels. Specifically, we first estimate an initial transition matrix by decomposing the multi-label problem into a series of binary classification problems, then the initial transition matrix is corrected by label correlations to enforce the addition of relationships among labels. We further show that the proposal is classifier-consistent, and additionally introduce an MSE-based regularizer to alleviate the tendency of BCE loss overfitting to noises. Experimental results have demonstrated the effectiveness of the proposed method.

Index Terms:

Complementary label learning, multi-label learning, transition matrix, label correlations.

1 Introduction

In multi-label learning (MLL), each instance is associated with a set of relevant labels, where the learned classifier aims to predict all relevant labels of unseen instances [1, 2]. MLL is widely used in many real-world applications, such as text categorization [3, 4], image retrieval [5], etc. However, collecting precisely multi-labeled data is laborious because of the unknown number of relevant labels per instance and the existence of complex semantic labels. For the example image in Fig. 1, besides the label Architecture, there exist other relevant labels whose accurate annotation needs one-by-one checking of the whole label space; in addition, annotators need special geographical and cultural domain knowledge to accurately label the image as Paris.

To release the laborious of annotating multi-labeled data, we explore the problem setting of multi-labeled CLL (ML-CLL), where each instance is associated with a single complementary label (an irrelevant label of the instance) instead of multiple relevant labels. Providing such weakly supervised information will ease the labeling process in large label space because selecting one complementary label is low-cost and requires less domain knowledge than selecting all relevant labels. One example of ML-CLL is given in Fig. 1 when selecting desert as the complementary label. Given the complementary label, the goal of ML-CLL is still the same as fully supervised MLL, i.e., learning a model that can accurately predict multiple relevant labels for unseen instances.

The setting of CLL was initially applied in the multi-class learning task [6, 7, 8, 9, 10, 11, 12]. Previous multi-class CLL approaches are based on an estimated transition matrix that summarizes the probability of a label being selected as a complementary label [6, 7, 8]. Although they have achieved a promising performance on multi-class data, they are restricted to the case where an instance is associated with only one relevant label. In this case, multi-class CLL approaches only consider the exclusive relationship among labels, while these approaches ignore that labels can bear other relationships in the multi-labeled case, especially the co-occurrence of labels. In fact, relationships among labels are crucial to solving ML-CLL problems since the selection of a complementary label of an instance in MLL is the combined result against multiple relevant labels rather than against only a single relevant label. Misusing a technique targeting against a single relevant label to the multiple relevant labels case will result in a wrongly estimated transition matrix.

In this paper, we first theoretically analyze how the estimation of the transition matrix using the current multi-class CLL techniques could be distorted in multi-labeled cases. According to these findings, we observe that estimating the transition matrix in ML-CLL from label correlations needs to know relevant labels of instances, while these are unavailable. To remove this pain, we propose a two-step method to estimate the transition matrix in ML-CLL from candidate labels which are the complement of complementary labels. Our strategy includes: (1) estimating an initial transition matrix by decomposing the multi-label problem into binary classification problems; (2) using label correlations to correct the initial transition matrix by enforcing the addition of relationships among labels. The fast convergence of Cross-Entropy (CE) loss benefits from focusing on instances that are difficult to classify, which may result in CE loss overfitting to noisy labeled data. As a type of CE loss, Binary CE (BCE) loss has the same problem. The study of [13] indicates that Mean square error (MSE) loss is less sensitive to noisy labels than CE loss. As Binary CE (BCE) loss is a benchmark of our approach, an MSE-based regularizer is further introduced to alleviate the tendency of it overfitting to noises.

In addition, we show that our proposed ML-CLL can be easily combined with learning from relevant labels, which significantly extends the application scenario of the proposed algorithm. This combination is particularly useful, e.g. when labels are collected via crowdsourcing[14] where crowdworkers are asked to randomly select a complementary label and one or more relevant labels for an instance. Experimental results on various datasets demonstrate the effectiveness of the proposed approach. Especially in situation when each instance is only equipped with a complementary label and a relevant label, our proposal has superior performance, even comparable with the performance on fully supervised data. Our main contributions are summarized as follows:

•

We theoretically analyze the distortion of the transition matrix estimated by multi-class CLL in multi-labeled cases, because multi-class CLL techniques ignore the co-existence of relevant labels. Theoretical findings reveal that multi-labeled data is indispensable for calculating the transition matrix from label correlations.

•

To solve the problem of unavailable multi-labeled data, we propose a two-step method to estimate the transition matrix from candidate labels. Moreover, we show theoretically that the proposed approach is classifier-consistent under a mild assumption.

•

We introduce a practical strategy – MSE-based regularization – to alleviate the overfitting tendency of BCE loss. Our empirical study shows that the proposal obtains comparable performance with state-of-the-art baselines, which proves the effectiveness of our approach.

The rest of this paper are organized as follows. Section 2 briefly reviews related work of ML-CLL. Then we formalize the ML-CLL problem in Section 3, analyze it theoretically and describe our approach in Section 4. In Section 5, we introduce an MSE-based regularization and show how to adapt our method to bear an additional small amount of relevant labels. The experimental results are given in Section 6 and we conclude in Section 7.

2 Related Work

In this section, we will give a brief review of related work of ML-CLL, including MLL, partial multi-label learning (PML) and multi-class CLL.

2.1 Multi-Label Learning

MLL problems aim to train a classifier that can predict a set of relevant labels for an unseen instance, where each training instance is associated with multiple relevant labels simultaneously. With the complexity of label correlation, the previous studies can be grouped into three categories[15, 16, 17, 18]: first-order approach [19, 20, 21], second-order approach [22, 23] and high-order approach [24, 25]. To solve MLL problems, the first-order approach decomposes MLL problems into a set of binary classification problems [19, 20]. However, these approaches ignore label correlations among labels, which play a crucial role in MLL [15]. After realizing the importance of label correlation, more and more studies attempt to exploit it to improve MLL performance. Among them, the second-order approach considers the pairwise label correlations that refer to the relationship between two labels. The kind of these approaches generally transform MLL problems into bipartite ranking problems by enforcing that relevant labels should be ranked higher than irrelevant labels [26, 27, 23]. Beyond second-order relationship, there exists more complex relationship between labels in many real-world scenarios. Therefore, many approaches begin to exploit high-order label correlations to handle the MLL problems recently [28, 24, 29, 30]. For example, Zhao et al. [30] leverage variational autoencoder to facilitate the learning process via exploiting high-order correlations among labels, while Wang et al. and Xun et al. [31, 32] both design special neural network blocks to automatically extract label correlations to improve the label prediction performance. Although high-order approaches have the ability of stronger label correlation-modeling, they may suffer from high computational cost comparing to first and second-orders approaches [33].

2.2 Partial Multi-Label Learning

Due to that the fully supervised data is difficult to collect, many reseachers tend to explore the weakly supervision data form to alleviate the heavy load of labeled data collection [34]. PML is a recently emerging weakly supervised approch firstly proposed by Xie et al. [35]. In PML, each training instance is associated with a set of candidate labels that consist of relevant labels and irrelevant (noisy) labels and the goal is to learn a classifier assigning a set of labels accurately for unseen instances.

At the first glance, it seems that ML-CLL is an extreme case of PML, such that all PML methods are also applicable to ML-CLL. However, existing PML methods assume that noisy only composes a small portion in the candidate labels [36, 37, 38, 33], such that many approaches [37, 38, 33] adopt matrix factorization matrix factorization to tackle PML problems, which decompose the candidate label matrix into the low-rank multi-label matrix and the sparse noisy label matrix. Compared to PML, the studied ML-CLL problem in this paper are target at the problem with only one complementary label, resulting in a high-noise PML problem on which the existing approaches can not be applicable. We will demonstrate the performance difference in the experimental part.

2.3 Multi-Class Complementary Label Learning

Currently, CLL problem is only considered in multi-class learning, whose goal is to predict a single relevant label per instance precisely from complementary labeled data. Previous approaches can be roughly grouped into two categories: (1) modeling the generative relationship between the complementary label and the relevant label [6, 12, 7, 8, 39]; (2) modeling the probability of complementary labels from the learned discriminative classifier directly [10, 9, 11].

The first multi-class CLL method belongs to category one. It models the generative relationship between complementary labels and relevant labels, and uses a such generative process to rewrite one-versus-all and pairwise comparison loss functions to derive an unbiased risk estimator [6]. Ishida et al. [7] realize that the method of [6] is restricted to loss functions and propose a new method which can use arbitrary losses and models. A typical way to make use of the modeled generative process is through a transition matrix, which summarizes the probabilities of a label being complementary labels when relevant labels are given. Then, approaches apply a transition matrix to recover relevant labels from complementary labels [8, 7, 39]. Compared with [6, 7], transition matrix-based methods can map more complex generative relationship rather than uniform one only. Therefore, we tend to design a transition matrix-based method to solve ML-CLL problem with a different estimating way.

Differ from category one, approaches residing in category two directly model the probabilities of complementary labels from the learned classifier without the generative relationship [9, 10, 11]. Chou et al. propose a surrogate complementary loss framework based on complementary labels providing negative feedback during the training process [9]. Although its losses fail to derive an unbiased risk estimator, it achieves good performance on the multi-class CLL. In light of the property of the complementary label that the predictive probability of the complementary label is expected to approach zero, [10] and [11] propose a discriminative solution by directly modeling the probabilities of complementary labels from learned classifier to avoid the generative assumption. Due to that multi-class CLL approaches are designed for a single relevant label case, which are not suitable for the ML-CLL case that an instance is associated with multiple labels simultaneously. We will demonstrate that in the experimental part.

3 Problem Setup

In MLL, let $\mathcal{X}$ be the feature space and $\mathcal{Y}=\{l_{1},l_{2},\dots,l_{K}\}$ be the finite label space with $K$ possible class labels ( $K>2$ ). A multi-label instance $\bm{x}\in\mathcal{X}$ is equipped with a set of relevant labels $Y\subseteq\mathcal{Y}$ . $(\bm{x},Y)$ is independently sampled from an unknown joint probability distribution $p(\bm{x},Y)$ . Here we exclude the special cases of $Y=\emptyset$ nor $\mathcal{Y}$ to ensure relevant labels and complementary labels both exist. For convenience, we use a binary vector $\bm{y}=[y^{1},y^{2},\dots,y^{K}]\in\{0,1\}^{K}$ to denote $Y$ , where $y^{k}=1$ indicates that $l_{k}\in Y$ is relevant to $\bm{x}$ and [math] otherwise. Suppose $D=\{(\bm{x}_{i},\bm{y}_{i})\}^{n}_{i=1}\stackrel{{\scriptstyle\text{ i.i.d. }}}{{\sim}}p(\bm{x},Y)$ is the training set with $n$ instances. The goal of MLL is to learn a multi-label classifier $h:\mathcal{X}\rightarrow 2^{\mathcal{Y}}$ , which can predict a set of relevant labels for any unseen instance. Instead of learning $h$ directly, most MLL methods tend to learn a real-valued decision function $\bm{f}:\mathcal{X}\rightarrow\mathbb{R}^{K}$ via minimizing the expected risk

[TABLE]

where $L$ is a proper MLL loss function [30], such as BCE loss. $\bm{f}(\bm{x})$ is usually interpreted as a probability vector: $f^{k}(\bm{x})$ is the $k$ -th entry of $\bm{f}(\bm{x})$ and predicts the confidence score that label $l_{k}$ is relevant to $\bm{x}$ , i.e., if properly normalized then $p(y^{k}=1|\bm{x})$ . Due to that $p(\bm{x},Y)$ is unknown, the expected risk is usually approximated by the empirical risk $\widehat{R}_{L}(\bm{f})=\frac{1}{n}\sum_{i=1}^{n}L(\bm{f}(\bm{x}_{i}),\bm{y}_{i})$ . If denoting the optimal classifier learned from the expected risk as $\bm{f}^{*}$ , i.e., $\bm{f}^{*}=\mathrm{argmin}_{\bm{f}}\;R_{L}(\bm{f})$ , then $\widehat{\bm{f}}^{*}$ denotes the optimal classifier learned by minimizing the empirical risk, i.e., $\widehat{\bm{f}}^{*}=\mathrm{argmin}_{\bm{f}}\;\widehat{R}_{L}(\bm{f})$ .

In ML-CLL studied in this paper, each training instance is equipped with a single complementary label. The complementary labeled instance $(\bm{x},\bar{y})\in(\mathcal{X},\mathcal{Y})$ is drawn from an unknown joint probability distribution $p(\bm{x},\bar{y})$ , where $\bar{y}\in\mathcal{Y}\setminus Y$ is a complementary label of $\bm{x}$ . $\bar{y}$ can be presented as a $K$ -dimensional vector $\bm{\bar{y}}=[\bar{y}^{1},\bar{y}^{2},\dots,\bar{y}^{K}]$ . If label $l_{j}$ is selected as the complementary label to $\bm{x}$ ( $\bar{y}=l_{j}$ ), then $\bar{y}^{j}$ is one and all other elements are zero in $\bm{\bar{y}}$ . We utilize $\widehat{Y}=\mathcal{Y}\setminus\bar{y}$ to denote the candidate label set of $\bm{x}$ . Let a $K$ -dimension vector $\bm{\widehat{y}}=[\widehat{y}^{1},\widehat{y}^{2},\dots,\widehat{y}^{K}]$ to be the corresponding vector representation of subset $\widehat{Y}$ , where all elements are one except that the one corresponding to the complementary label is set to be zero ( $\bm{\widehat{y}}=\bm{1}-\bm{\bar{y}}$ ).

Let $\bar{D}=\{(\bm{x}_{i},\bar{y}_{i})\}_{i=1}^{n}\stackrel{{\scriptstyle\text{ i.i.d. }}}{{\sim}}p(\bm{x},\bar{y})$ be the ML-CLL training set with $n$ instances. The expected risk of multi-labeled CLL is defined over $p(\bm{x},\bar{y})$ :

[TABLE]

where $\bar{L}$ denotes a ML-CLL loss, which will be proposed later this paper. Similarly, the corresponding empirical risk is described as $\widehat{R}_{\bar{L}}(\bm{f})=\frac{1}{n}\sum_{i=1}^{n}\bar{L}(\bm{f}(\bm{x}_{i}),\bm{\bar{y}}_{i})$ .

4 The Proposed Approach

In this section, we first introduce the definition of the transition matrix in MLL and analyze why the estimated transition matrix using multi-class techniques is unsuitable for ML-CLL. Then, we describe an advanced two-step way to estimate the transition matrix in the MLL case. Finally, we prove our approach is classifier-consistent with a mild assumption.

4.1 Transition Matrix for ML-CLL

In ML-CLL, we start by introducing a transition matrix $\mathbf{\tilde{T}}$ that summarizes the probabilities for a complementary label given a set of relevant labels. More specifically, the transition matrix $\mathbf{\tilde{T}}$ is defined as $\mathbf{\tilde{T}}_{kj}=p(\bar{y}^{j}=1|Y=C_{k})$ where $C_{k}\in\mathcal{Y}^{\prime}=\{2^{\mathcal{Y}}-\emptyset-\mathcal{Y}\}$ ( $k\in[2^{K}-2]$ ) is the $k$ -th label subset. If $l_{j}\in C_{k}$ , then $\mathbf{\tilde{T}}_{kj}=0$ because the label $l_{j}$ has no chance to be selected as the complementary label. In this paper, we employ the same class-dependent assumption as the multi-class CLL approach [8]: $p(\bar{y}|Y,\bm{x})=p(\bar{y}|Y)$ as $\bar{y}$ and $\bm{x}$ are conditionally independent given $Y$ . Then we can obtain the following equation:

[TABLE]

where we assume the label $l_{j}$ is a complementary label of $\bm{x}$ . Then, according to Eq.(3), $p(\bar{y}|\bm{x})$ can be approximated by $p(Y|\bm{x})$ when the transition matrix $\mathbf{\tilde{T}}$ is known. If considering all possible label subsets of $\mathcal{Y}^{\prime}$ as $C$ , we have $\mathbf{\tilde{T}}\in\mathbb{R}^{(2^{K}-2)\times K}$ , i.e., the size of $\mathbf{\tilde{T}}$ depends on the size of the power set of $\mathcal{Y}^{\prime}$ . Practically, the power set of $\mathcal{Y}^{\prime}$ would be computationally prohibitive and even impossible to store, since $2^{K}-2$ is an extremely large number when the number of possible labels $K$ is large. To solve this combinatorial explosion problem, we explore a more practical way to use an alternative lower-dimensional transition matrix to replace the higher-dimensional one. We start investigating the feasibility of the alternative lower-dimensional matrix from Theorem 1.

Theorem 1.

Given an instance $\bm{x}$ , suppose $Y$ is the relevant label set and the label $l_{j}$ is the complementary label which is randomly selected. Then the following equality holds:

[TABLE]

The second inequality holds because of addition rule of probability. The detailed proof is in Appendix A. Theorem 1 shows that using $\mathbf{T}$ to approximate $p(\bm{\bar{y}}|\bm{x})$ is a lower bound of using $\mathbf{\tilde{T}}$ to approximate $p(\bm{\bar{y}}|\bm{x})$ . Observed by Eq.(3), we find that our main goal transforms from precisely predicting the relevant label set $Y$ of $\bm{x}$ to precisely predicting its complementary label $\bar{y}$ via the transition matrix $\mathbf{\tilde{T}}$ . This means that we need to maximize the predictive probability of the complementary label of $\bm{x}$ , i.e., maximizing $p(\bar{y}|\bm{x})$ . From this point of view, Theorem 1 theoretically shows the feasibility of using a low-dimension transition matrix to replace the high-dimension $\mathbf{\tilde{T}}$ , because we optimize by maximizing the lower bound of Eq.(3). Let $\mathbf{T}\in[0,1]^{K\times K}$ denote the lower-dimensional transition matrix, where the $(k,j$ )-th element of $\mathbf{T}$ is $\mathbf{T}_{kj}=p(\bar{y}^{j}=1|y^{k}=1)$ , and $\mathbf{T}_{kj}=0$ when $k=j$ . Thus, we adopt the $K\times K$ matrix $\mathbf{T}$ as the transition matrix in the following of the paper to avoid the pain in computation and storage brought up by the $(2^{K}-2)\times K$ matrix $\mathbf{\tilde{T}}$ .

4.2 Distortion in Estimating the Transition Matrix

Before exploring how the transition matrix estimated by multi-class CLL is distorted from that of ML-CLL, we first introduce the transition matrix estimated by multi-class CLL techniques. Suppose $\mathbf{Q}\in[0,1]^{K\times K}$ be the transition matrix estimated in multi-class CLL. Recalling the approach [8], it estimates the transition matrix under a special assumption: for each label $l_{k}$ , existing an anchor set $\mathcal{S}_{\bm{x}|l_{k}}\subset\mathcal{X}$ such that $p(y^{k}=1|\bm{x})=1$ and $p(y^{k^{\prime}}=1|\bm{x})=0$ ( $l_{k^{\prime}}\in\mathcal{Y}\setminus\{l_{k}\}$ ). With this assumption and regardless of label correlations, the estimation of $\mathbf{Q}_{kj}$ is $p(\bar{y}^{j}=1|y^{k}=1)=p(\bar{y}^{j}=1|\bm{x})$ iff $\bm{x}$ is sampled from $\mathcal{S}_{\bm{x}|l_{k}}$ , where $\mathbf{Q}_{kj}$ is the $k$ -th row and $j$ -th column element of $\mathbf{Q}$ .

To measure the distortion between $\mathbf{T}$ calculated in ML-CLL and the estimated $\mathbf{Q}$ , we define their difference on the complementary label $l_{j}$ of $\bm{x}$ as follows

[TABLE]

The larger value of $\sum_{j=1}^{K}\ell_{j}$ indicates that $\mathbf{T}$ deviates further from $\mathbf{Q}$ . As we know, label correlations and co-occurred multiple labels are key properties of MLL. Due to that the correlations among labels are intricate, directly calculating $\mathbf{T}$ from all label correlations will bring high computational cost. For convenience, we give a simple case of MLL including label correlations – at most two labels can co-occur for an instance, and the rest of labels are mutually exclusive – to facilitate us calculating $\mathbf{T}$ from label correlations and explore the distortion of $\mathbf{T}$ and $\mathbf{Q}$ . We start to study the above contents from the definition of mutually exclusive.

Definition 2.

For any $\bm{x}\in\mathcal{X}$ , only a label is relevant to $\bm{x}$ , i.e. $|Y|=1$ , which labels are mutually exclusive.

Under the simple case in MLL, in Theorem 3, we state how to estimate $\mathbf{T}$ directly from label correlations, and the distortion of $\mathbf{T}$ and $\mathbf{Q}$ .

Theorem 3.

Under a MLL scenario: suppose the labels $l_{z_{1}},l_{z_{2}}\in\mathcal{Y}$ ( $z_{1},z_{2}\in[K],z_{1}\neq z_{2}$ ) are dependent, and the labels belonging to $\mathcal{Y}\setminus\{l_{z_{1}},l_{z_{2}}\}$ are mutually exclusive. For any $\bm{x}$ , its label set $Y\subseteq\{l_{z_{1}},l_{z_{2}}\}$ and $Y\neq\emptyset$ . Let the label $l_{j}$ ( $j\in[K],j\neq z_{1},z_{2}$ ) be the complementary label of $\bm{x}\in\mathcal{X}$ . $\mathbf{T}_{z_{1}j}$ and $\mathbf{T}_{z_{2}j}$ calculated from label correlations satisfy

[TABLE]

where $[K]$ denotes the integer set $\{1,2,\dots,K\}$ . The difference of $\mathbf{T}$ and $\mathbf{Q}$ on the complementary label $l_{j}$ is

[TABLE]

where $\xi=\max\{p(y^{z_{2}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\bm{x}),p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,\bm{x})\}$ .

The proof is provided in Appendix B. From Theorem 3, we can see that calculating the transition matrix from label correlations is more complex than estimating one without label correlations, and the relevant label sets of instances need to be known. Moreover, Theorem 3 shows that there is a distortion between $\mathbf{T}$ and $\mathbf{Q}$ , which widely exists in multi-labeled cases since each multi-label instance is relevant to multiple labels. The above learning scenario only considers the pairwise label correlations, while there exists a more complex relationship among labels. Similarly, under a realizable computational cost, we construct another simple MLL scenario with more complex label relationships to explore factors that affect $\ell_{j}$ in Corollary 4.

Corollary 4.

Under a MLL scenario: there are $m$ ( $m\geq 2$ ) labels $l_{z_{1}},l_{z_{2}},\dots,l_{z_{m}}\in\mathcal{Y}$ $(z_{1},\dots,z_{m}\in[K])$ that are dependent, while the labels belong to $\mathcal{Y}\setminus\{l_{z_{1}},l_{z_{2}},\dots,l_{z_{m}}\}$ are mutually exclusive. For any $\bm{x}\in\mathcal{X}$ , its relevant set $Y\subseteq\{l_{z_{1}},l_{z_{2}}\dots,l_{z_{m}}\}$ and $Y\neq\emptyset$ . Suppose the label $l_{j}$ is the complementary label of $\bm{x}$ . The difference $\ell_{j}$ between $\mathbf{T}$ and $\mathbf{Q}$ has

[TABLE]

where $\xi=\mathrm{max}\{p(y^{z_{m}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\dots,y^{z_{m-1}}=1,\bm{x}),p(y^{z_{m-1}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\dots,y^{z_{m-2}}=1,y^{z_{m}}=1,\bm{x}),\dots,p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,\dots,y^{z_{m}}=1,\bm{x})\}$ $(\xi\in(0,1])$ .

The proof is shown in Appendix C. According to Corollary 4, when label correlations are more complex, the distortion of the transition matrix estimated by the multi-class CLL approach is more serious as $m$ increases. Meanwhile, it demonstrates that the ML-CLL problem cannot be solved by current techniques in multi-class CLL.

4.3 Estimation $\mathbf{T}$ with Label Correlations

As discussed above, calculating the transition matrix $\mathbf{T}$ from label correlations needs instances whose relevant label sets are known. Moreover, calculating $\mathbf{T}$ is more and more difficult as relationships among labels become more complex by observing the results of $\mathbf{T}$ in Theorem 3 and Corollary 4. Due to that multi-labeled data are unavailable for our setting, we propose a two-step method to estimate $\mathbf{T}$ from candidate labels, and it can reduce the complexities in calculating $\mathbf{T}$ from label correlations. This two-step method includes: (1) computing an initial transition matrix $\mathbf{S}\in[0,1]^{K\times K}$ from candidate labels by decomposing the multi-label problem into a series of binary classification problem; (2) obtaining the final estimation of $\mathbf{T}$ by using label correlations to correct $S$ .

Computing an initial transition matrix $\mathbf{S}$ . Let $\mathbf{S}_{kj}=p(\bar{y}^{j}=1|\widehat{y}^{k}=1)$ be an initial transition probability, which is a $(k,j)$ -th element of $\mathbf{S}$ . We caulculate $\mathbf{S}$ from candidate labels of instances. Multiplication theorem of probability 111 $p(\bm{x},\bar{y}^{j}=1,\widehat{y}^{k}=1)=p(\bar{y}^{j}=1|\widehat{y}^{k}=1,\bm{x})p(\bm{x}|\widehat{y}^{k}=1)p(\widehat{y}^{k}=1)=p(\bar{y}^{j}=1|\widehat{y}^{k}=1)p(\bm{x}|\bar{y}^{j}=1,\widehat{y}^{k}=1)p(\widehat{y}^{k}=1)\Rightarrow p(\bar{y}^{j}=1|\widehat{y}^{k}=1,\bm{x})p(\bm{x}|\widehat{y}^{k}=1)=p(\bar{y}^{j}=1|\widehat{y}^{k}=1)p(\bm{x}|\bar{y}^{j}=1,\widehat{y}^{k}=1)$ is applied to calculate $\mathbf{S}_{kj}$ and ensure that the following equation holds:

[TABLE]

where $j,k\in[K]$ and $j\neq k$ . In practice, $\mathbb{E}_{p(\bm{x}|\widehat{y}^{k}=1)}[p(\bar{y}^{j}=1|\widehat{y}^{k}=1,\bm{x})]$ can be approximated by the expectation of $p(\bar{y}^{j}=1|\widehat{y}^{k}=1,\bm{x})$ over the conditional distribution $p(\bm{x}|\widehat{y}^{k}=1)$ . Assuming $\bar{y}$ and $\widehat{Y}$ are conditionally independent given $\bm{x}$ , so $p(\bar{y}^{j}=1|\widehat{y}^{k}=1,\bm{x})=p(\bar{y}^{j}=1|\bm{x})$ . Intuitively, $p(\bar{y}^{j}=1|\bm{x})$ can be approximated by the classifier learned from $\bar{D}$ to predict the probability of complementary labels. Let $A_{k}$ denote the subset of $\bm{x}$ in $\bar{D}$ with $\widehat{y}^{k}=1$ , which satisfies the conditional distribution $p(\bm{x}|\widehat{y}^{k}=1)$ . Thus, $\mathbf{S}_{kj}$ can be estimated by

[TABLE]

Estimating $\mathbf{T}$ with label correlations. The calculating procedure of $\mathbf{S}$ lacks exactly supervised data. Observed by the transition probabilities of $\mathbf{T}$ calculated from label correlations in subsection 4.2, we can find that they are affected by label correlations. Moreover, a label that is low-co-occurred to the relevant labels could be preferentially selected as the complementary label from the view of label correlations. For example, considering water as the relevant label; in this case, desert (low-co-occurred label) will have a larger chance to be selected as the complementary label compared to fish (high-co-occurred label). Motivated by these findings, we use label correlations to correct the initial matrix $\mathbf{S}$ to estimate $\mathbf{T}$ by enforcing the addition of relationships among labels.

Suppose $\mathbf{C}\in[0,1]^{K\times K}$ be a label correlation matrix, where the element $\mathbf{C}_{kj}$ represents the correlation between labels $l_{k}$ and $l_{j}$ . The value of $\mathbf{C}_{kj}$ is larger when the correlation of labels $l_{k}$ and $l_{j}$ is stronger. Following [35, 40], we adopt the co-occurrence rate of two candidate labels as their correlations. Finally, the transition matrix $\mathbf{T}$ can be estimated by $\mathbf{\widehat{T}}=\mathbf{S}\mathbf{C}^{T}$ , where $\mathbf{\widehat{T}}_{kj}=0$ if $k=j$ , and normalizing $\mathbf{T}$ by row.

Fig. 2 is an example of refining procedure. As can be seen from the Fig. 2, though the estimated initial probability of $p(\bar{y}^{2}=1|\widehat{y}^{1}=1)$ is higher than $p(\bar{y}^{3}=1|\widehat{y}^{1}=1)$ in $\mathbf{S}$ , the value of $p(\bar{y}^{2}=1|y^{1}=1)$ is lower than $p(\bar{y}^{3}=1|y^{1}=1)$ in $\mathbf{\widehat{T}}$ . This is because the labels $l_{1}$ and $l_{2}$ have a strong correlation as shown in $\mathbf{C}$ , so the label $l_{2}$ has a lower chance to be selected as the complementary label for the label $l_{1}$ . The corrected initial transition matrix $\mathbf{S}$ agrees with our expectation on the low-co-occurred labels that tend to be selected as complementary labels preferentially. In practice, the estimation of $\mathbf{T}$ depends on $p(\bar{y}|\bm{x})$ , where the classifier should perfectly model the probability of complementary labels. When data equipped with complementary labels is sufficiently, the perfect model is capable of modeling $p(\bar{y}|\bm{x})$ .

4.4 A Classifier-Consistent Approach

According to the transition matrix $\mathbf{T}$ , we can derive the probability of complementary labels from multi-label classifier. Let $\bm{\bar{f}}(\bm{x})\in\mathbb{R}^{K}$ be a complementary label classifier, which is defined as

[TABLE]

where $\bm{\bar{f}}(\bm{x})$ is applied to approximate $p(\bm{\bar{y}}|\bm{x})$ , $\bar{f}^{j}(\bm{x})$ refers to the $j$ -th element of $\bm{\bar{f}}(\bm{x})$ . ML-CLL problems aim to recover a set of relevant labels per instance from a complementary label. Since training instances are associated with complementary labels, the common loss functions of MLL are unsuitable for ML-CLL. Therefore, we define a complementary loss function $\bar{L}$ as

[TABLE]

Denote by $\bm{f}_{CL}^{*}$ the minimizer of $R_{\bar{L}}(\bm{f})$ , the minimizer $\widehat{\bm{f}}^{*}_{CL}$ of $\widehat{R}_{\bar{L}}(\bm{f})$ is used to approximated $\bm{f}_{CL}^{*}$ . Recalling the definition of classifier-consistent, if a classifier learned by an approach finally converges to the optimal classifier $\bm{f}^{*}$ learned in MLL as the number of instances increases, then this approach is classifier-consistent [41, 42, 43]. We derive our proposal is classifier-consistent based on a mild assumption:

Assumption 5.

Suppose the transition matrix $\mathbf{T}$ is invertible and can perfectly recover the relationship between relevant labels of $\bm{x}$ and its complementary label. Then, we have $\bm{\bar{y}}=\mathbf{T}^{T}\bm{y}$ .

With Assumption 5, our approach trained on $\bar{L}$ can be inferred to be classifier-consistent, which is stated in Theorem 6. Naturally, Theorem 6 guarantees that the optimal classifier learned from complementary labeled data converges to the optimal one learned from fully supervised MLL.

Theorem 6.

With Assumption 5, suppose the transition matrix $\mathbf{T}$ is invertible, then the ML-CLL optimal classifier $\bm{f}_{CL}^{*}$ converges to the MLL optimal classifier $\bm{f}^{*}$ , i.e., $\bm{f}_{CL}^{*}=\bm{f}^{*}$ .

The proof is represented in Appendix D. Thanks to BCE loss is a popular loss function in MLL, we adopt BCE loss as the base in this paper, then $\bar{L}$ is expressed as

[TABLE]

where $\bm{1}$ denotes a $K$ -dimensional vector with 1 for all elements.

5 Regularization-Based Enhancement

In this section, an MSE-based regularization of our approach is described. And we attempt to combine a small amount of relevant labels to explore more possibilities of our proposal.

5.1 An MSE-Based Regularization

Previous works indicate that CE loss always makes the model focus on hard instances that are difficult to be classified precisely, while MSE loss and Mean Absolute Error (MAE) loss are less sensitive to hard instances since they treat per instance coequally [44, 13]. As this property, the convergence rate of CE loss is superior to MSE loss and MAE loss, whereas this property makes CE loss more prone to the overfitting problem than MSE loss and MAE loss when noisy labels present at training data [44, 13]. Actually, an excellent approach can converge quickly during the training process, and shows good generalization ability and robustness for unseen instances[11].

Obviously, BCE loss has a similar property to CE loss, which results in an excellent convergence rate of approaches. Meanwhile, approaches based on BCE loss are easy to suffer from the overfitting problem when using noisy labeled data to learn. In fact, ML-CLL is a problem setting with dense noisy labels, BCE loss may cause the overfitting problem of a model in ML-CLL. To cope with this problem, we introduce an MSE-based regularizer based on MSE loss (i.e. $\ell_{2}$ -norm regularization) to balance the robust and convergence requirement of the proposed approach. Hence, the MSE-based regularizer is defined as:

[TABLE]

Finally, we combine the complementary loss and the MSE-based regularizer term, which leads to our target loss:

[TABLE]

where $\beta$ is the trade-off parameter and set as 1 (the selection shown in Section 6). The all procedure of the proposed approach (called MLCL) is shown in Algorithm 1.

5.2 Incorporation of Relevant Labels

In many practical situations, we can use complementary labels and relevant labels to learn more accurate classifiers, which is highly practical implementation. To this end, motivated by [6, 45], let us design a reasonable combination of the loss derived from complementary labeled data and relevant labeled data:

[TABLE]

where $\bm{\tilde{y}}=[\tilde{y}^{1},\dots,\tilde{y}^{1}]\in\{0,1\}^{K}$ denotes a binary vector of relevant labels $\tilde{Y}$ of $\bm{x}$ , in which $\tilde{y}^{1}=1$ when the label $l_{k}\in\tilde{Y}$ . To provide more practicability, we do not restrict given relevant labels $\tilde{Y}$ to must be equal to the set of relevant labels $Y$ , which means $\tilde{Y}\subseteq Y$ and $\tilde{Y}\neq\emptyset$ .

As explained in the instruction, we can naturally collect data associated with complementary labels and relevant labels via crowdsourcing [14]. Our loss function Eq.(12) can leverage both kinds of labeled data to learn better classifiers. We will experimentally show the usefulness of this combination method in Section 6.

6 Experiments

In this section, we will evaluate the effectiveness of MLCL, where five common MLL criteria, including ranking loss, hamming loss, one error, coverage and average precision, are employed in this paper. The values of first four criteria are smaller, the performance of approach is better. While the value of average precision is greater, the better the performance. The label set of $\bm{x}$ is predicted by $Y=\{l_{k}|f^{k}(\bm{x})>0.5,1\leq k\leq K\}$ . All experiments use PyTorch [46] and NVIDIA TESLA K80 GPU to implement. The code will be released after this paper has been accepted.

6.1 Experimental Settings

Datasets. We use eight widely-used MLL datasets, namely corel5k, corel16k, delicious, eurlex $\_$ dc, eurlex $\_$ sm, yeast, bookmarks and scene, to our experiments222Publicly available at http://mulan.sourceforge.net/datasets.. Following [35, 36], we adopt the same pre-processing to deal with the datasets. More specifically, rare class labels are filtered out for datasets with more than 15 class labels, whose class labels are kept under 15. Accordingly, instances that are relevant with removed class labels are filtered out as well. Detailed characteristics of these datasets are shown in Table I.

Base models. The linear model is used as the base model.

Baselines. Two typical MLL approaches, ML-KNN [21] and LIFT [47], are utilized as baselines, which deal with ML-CLL via regarding all possible labels in the candidate label set as relevant labels for a training instance. Similarly, three recent PML approaches are employed as comparing approaches, including PML-lc [35], fpml [38] and PML-LRS [37], which learn from training instances associated with candidate labels. In addition, we employ a multi-class CLL approach, called L-UW [10], as a baseline, which uses BEC loss and sigmoid output layer instead of CE loss and softmax output layer respectively to make L-UW suit for multi-labeled data.

6.2 Comparison on Uniform Complementary Labels

Setup. Weight-decay is set as $1e-4$ and learning rate is selected from $\{1e-1,1e-2,1e-3\}$ for all data sets. We employ Adam [48] optimization method, and set the number of batch-size and epoch as 256 and 200 respectively. L-UW applies the same model and hyper-parameters as ours. Here, we estimate $\mathbf{T}$ with a linear model. We use Ten-fold cross-validation to evaluate experiments, where training data is associated with complementary labels that are generated by randomly selecting one of possible labels excepting relevant labels (uniform complementary labels), and test data is equipped with the set of relevant labels. The mean metrics value and standard deviation (std) will be reported as final experimental results for all approaches.

Results. Table II is utilized to report experimental results of various approaches on eight data sets equipped with uniform complementary labels. $\uparrow/\downarrow$ indicates the larger/smaller the value, the better the performance.

According to reported results in Table II, we can observe that results of MLCL are superior or comparable performance against baselines out of different data sets on five criteria. Our approach achieves the best performance in most cases. Specifically, the proposed approach outperforms LIFT on eight datasets across all metrics. This is because our approach is better at tackling the issue that training data is associated with relevant labels and irrelevant labels simultaneously than fully supervised MLL algorithms. Furthermore, experimental results of PML-lc and PML-LRS are inferior to ours in most cases, which demonstrate that PML approaches are indeed inferior to our approach in cases of dense noisy labels. Similarly, based on the results of L-UW shown in Table II, we observe that our approach outperforms L-UW on almost all datasets and metrics other than ranking loss and coverage on the delicious dataset. This reflects that label correlations are important to solve ML-CLL problems, which leads to the proposed approach taking label correlations into account surpasses L-UW that ignores label correlations.

6.3 Comparison on Biased Complementary Labels

Setup. To evaluate the effectiveness of our approach in different situations, we utilize training data with biased complementary labels that are generated via the co-occurrence rate of relevant labels. Specifically, we select a complementary label of an instance $\bm{x}$ from $\mathcal{Y}\setminus Y$ , and the selecting rule follows: the class label with a lower co-occurrence rate has a higher probability to be selected as a complementary label. We adopt training data with biased complementary labels to train the model, while test data is equipped with relevant label sets to evaluate the effectiveness of our approach. For other experimental settings, we apply same settings with Subsection 5.2.

Results. The mean and std of results on test data are shown in Table III. According to results shown in Table III, we can summarize the following impressive observations: (1) MLCL achieves superior or comparable performance to LIFT, fpml, PML-lc, PML-LRS and L-UW on different data sets, which proves that the proposed approach can predict the set of proper labels for unseen instances from complementary labeled data; (2) Although MLCL fails to achieve the best result on the scene dataset, our approach is better than other baselines in the rest of datasets, which indicates that our approach can effectively deal with ML-CLL problems than others. These observations demonstrate that the proposed method can both hold for the situation of data with uniform and biased complementary labels.

6.4 Additional Experiments

Ablation experiments. We then explore the effect of different learning components on MLCL performance. Table IV summarizes results of MLCL without the different component, which trains on the data with uniform complementary labels. In Table IV, without $\mathbf{C}$ refers to MLCL directly use the estimated initial transition matrix $\mathbf{S}$ to train, and without $\bar{L}_{mse}$ indicates that MLCL only utilizes Eq.(9) to optimaze.

From results reported in Table IV, the performance of MLCL surpasses that without different components in most cases, which shows that two components, including using label correlations to correct and an MSE-based regularizer, are beneficial for our approach to improve the performance. Especially, estimating $\mathbf{T}$ based on label correlations pushes the proposed approach performance forward significantly compared with that without $\mathbf{C}$ on most cases. Similarly, an MSE-based regularizer brings significant benefits for our approach, which demonstrates that an MSE-based regularizer balances the robustness and convergence rate of BCE loss. These indicate that using label correlations to estimate the transition matrix $\mathbf{T}$ and an MSE-based regularizer are effective strategies to alleviate ML-CLL problems.

Trade-off parameter $\beta$ . Table V reports the performance of MLCL with varying $\beta$ values that trade-off the complementary loss function $\bar{L}$ and an MSE-based regularization $\bar{L}_{mse}$ . Here, average precision is regarded as the criterion, and the training data is with uniform complementary labels. $\beta$ is selected from the candidate value list $\{0.1,0.3,0.5,0.8,1\}$ . We can observe the best results of most datasets is achieved at $\beta=1$ and the performance drops when $\beta$ takes a smaller value. In general, a relatively large $\beta$ $(\beta\leq 1)$ usually leads to better performance than a small value. Therefore, we set $\beta=1$ for MLCL.

6.5 Combination of Complementary Labels and Relevant Labels

Setup. Finally, we demonstrate the effectiveness of combining relevant labeled data and complementary labeled one. The training data is associated with uniform complementary labels and relevant labels simultaneously. More specifically, an instance $\bm{x}$ is associated with a complementary label $\bar{y}$ and relevant labels $\tilde{Y}$ , where $\bar{y}$ is uniformly selected and $\tilde{Y}$ is randomly selected from the relevant label set $Y$ of $\bm{x}$ (i.e., $\tilde{Y}\subseteq Y$ ). Here, we set $|\tilde{Y}|=1$ that means each instance only associated with a complementary label and a relevant label. The other experimental settings are the same with Subsection 5.2.

Results. We compare three methods: (1) the “Fully supervised” method uses the linear model to train with the fully supervised data, which is fully supervised MLL; (2) the “CL” method refers to MLCL training with the uniform complementary-label data; (3) the combination (“CL & RL”) method adopts the linear model with the loss function Eq.(12) to train, where the training data is equipped with the combination of complementary labels and relevant labels. Table VI reports the experimental results on five criteria. We can see that the performance of “CL& RL” method is much superior to “CL” method on all datasets over hamming loss, ranking loss, one error, coverage and average precision, such as “CL& RL” method outperforms “CL” method by a large margin over average precision (+0.436 on eurlex_dc and +0.425 on eurlex_sm). This demonstrates that the ML-CLL is easily applied to fully supervised MLL scenarios, MLL with missing labels [49, 50] or other MLL scenarios. Moreover, “CL & RL” method achieves comparable performance to “Fully supervised” method, which illustrates that ML-CLL can get excellent results just via increasing a few additional information. This is useful for application in the real world, because ML-CLL can obtain good performance through less expensive labeled data.

7 Conclusion

In this paper, we theoretically analyze the reason causing why the estimated transition matrix in multi-class CLL is distorted in ML-CLL. To alleviate the pain in directly calculating the transition matrix from complex label correlations under multi-labeled data is unknown, we propose a two-step method to estimate the transition matrix $\mathbf{T}$ in ML-CLL, which adopts label correlations to correct an initial transition matrix. Furthermore, we theoretically show that the proposed approach is classifier-consistent. Additionally, due to MSE loss achieving a prominent robust, an MSE-based regularizer is introduced to alleviate the tendency of the fast convergent BCE loss overfitting to noises. Finally, we show that our proposed ML-CLL can be easily combined with relevant labels and the proposed method can achieve a comparable performance to fully supervised MLL through a few additional information.

Appendix A The Proof of Theorem 1

Theorem 1. Given an instance $\bm{x}$ , suppose $Y$ is the relevant label set and the label $l_{j}$ is the complementary label which is randomly selected. Then the following equality holds:

[TABLE]

Proof.

Firstly, we should introduce addition rule of probability: $p(AB)=p(A)+p(B)-p(A\cup B)$ , so we have $p(AB)\geq p(A)+p(B)$ . We start to prove the above inequlity. According to the assumption: $p(\bar{y}|Y)=p(\bar{y}|Y,\bm{x})$ , we have

[TABLE]

According to addition rule of probability, so we have

[TABLE]

∎

Appendix B The Proof of Theorem 3

Theorem 3. Under a MLL scenario: suppose the labels $l_{z_{1}},l_{z_{2}}\in\mathcal{Y}$ ( $z_{1},z_{2}\in[K],z_{1}\neq z_{2}$ ) are dependent, and the labels belonging to $\mathcal{Y}\setminus\{l_{z_{1}},l_{z_{2}}\}$ are mutually exclusive. For any $\bm{x}\in\mathcal{X}$ , its label set $Y\subseteq\{l_{z_{1}},l_{z_{2}}\}$ and $Y\neq\emptyset$ . Let the label $l_{j}$ ( $j\in[K],j\neq z_{1},z_{2}$ ) be the complementary label of $\bm{x}$ . $\mathbf{T}_{z_{1}j}$ and $\mathbf{T}_{z_{2}j}$ calculated from label correlations satisfy

[TABLE]

where $[K]$ denotes the integer set $\{1,2,\dots,K\}$ . The difference of $\mathbf{T}$ and $\mathbf{Q}$ on the complementary label $l_{j}$ is

[TABLE]

where $\xi=\max\{p(y^{z_{2}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\bm{x}),p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,\bm{x})\}$ .

Proof.

We start calculating the difference $\ell_{j}$ from estimating the transition probabilities $\mathbf{T}_{z_{1}j}$ and $\mathbf{T}_{z_{1}j}$ . According to Definition 2 and the description of Theorem 3, we have

[TABLE]

Based on the assumption of that $\bar{y}$ and $\bm{x}$ are conditionally independent given $Y$ , then we can have

[TABLE]

Since $p(\bar{y}^{j}=1|y^{z_{1}}=0)$ and $p(\bar{y}^{j}=1|y^{z_{2}}=0)$ do not hold according to the definition of the transition matrix, and then we can obtain

[TABLE]

Similarly, we can get

[TABLE]

Next, we calculate the difference $\ell_{j}$ . The rest elements of $\mathbf{T}_{\cdot j}$ are same as that estimated by multi-class CLL. According the definition of $\ell_{j}$ , we have

[TABLE]

Because $0\leq p(y^{z_{1}}=1|\bm{x})\leq p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,\bm{x})\leq 1$ and $0\leq p(y^{z_{2}}=1|\bm{x})\leq p(y^{z_{2}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\bm{x})\leq 1$ , $\xi$ is defined as $\xi=\mathrm{max}\{p(y^{z_{2}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\bm{x}),p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,\bm{x})\}$ , the above inequation holds. ∎

Appendix C The Proof of Corollary 4

Corollary 4. Under a MLL scenario: there are $m$ ( $m\geq 2$ ) labels $l_{z_{1}},l_{z_{2}},\dots,l_{z_{m}}\in\mathcal{Y}$ $(z_{1},\dots,z_{m}\in[K])$ that are dependent, while the labels belong to $\mathcal{Y}\setminus\{l_{z_{1}},l_{z_{2}},\dots,l_{z_{m}}\}$ are mutually exclusive. For any $\bm{x}\in\mathcal{X}$ , its relevant set $Y\subseteq\{l_{z_{1}},l_{z_{2}}\dots,l_{z_{m}}\}$ and $Y\neq\emptyset$ . Suppose the label $l_{j}$ is the complementary label of $\bm{x}$ . The difference $\ell_{j}$ between $\mathbf{T}$ and $\mathbf{Q}$ has

[TABLE]

where $\xi=\mathrm{max}\{p(y^{z_{m}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\dots,y^{z_{m-1}}=1,\bm{x}),p(y^{z_{m-1}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\dots,y^{z_{m-2}}=1,y^{z_{m}}=1,\bm{x}),\dots,p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,\dots,y^{z_{m}}=1,\bm{x})\}$ $(\xi\in(0,1])$ .

Proof.

Here, we apply induction to get the difference as $m$ increases. We start by computing the difference in the case of $m=3$ . Suppose class labels $l_{z_{1}},l_{z_{2}},l_{z_{3}}\in\mathcal{Y}$ are dependent, while the rest of labels in the label space are mutually exclusive. $\bm{x}$ is associated with $Y\subseteq\{l_{z_{1}},l_{z_{2}},l_{z_{3}}\}$ and $Y\neq\emptyset$ . Then we calculate transition probabilities in $\mathbf{T}$ from label correlations according to Theorem 3 as:

[TABLE]

$\mathbf{T}_{z_{2}j}$ and $\mathbf{T}_{z_{3}j}$ use the same way to estimate. Due to $0\leq p(y^{z_{1}}=1|\bm{x})\leq p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,\bm{x})\leq p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,y^{z_{3}}=1,\bm{x})\leq 1$ , let $\xi=\mathrm{max}\{p(y^{z_{3}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,y^{z_{2}}=1,\bm{x}),p(y^{z_{2}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,y^{z_{3}}=1,\bm{x}),p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,y^{z_{3}}=1,\bm{x})\}$ , we can obtain

[TABLE]

Similarly, we can compute $\mathbf{T}_{z_{2}j},\mathbf{T}_{z_{3}j}\geq\frac{1}{\xi^{3}}p(\bar{y}^{j}=1|\bm{x})$ . Then the difference $\ell_{j}$ is

[TABLE]

Similarly, for any $m$ $(0<m<K)$ , suppose class labels $l_{z_{1}},l_{z_{2}},\dots,l_{z_{m}}\in\mathcal{Y}$ are strongly dependent, while the rest of labels in the label space are mutually exclusive. $\bm{x}$ is associated with $Y\subseteq\{l_{z_{1}},l_{z_{2}},l_{z_{3}}\}$ and $Y\neq\emptyset$ . Then we calculate transition probabilities from label correlations:

[TABLE]

As discussed above, $\mathbf{T}_{z_{1}j}\geq\frac{1}{\xi^{m}}p(\bar{y}^{j}=1|\bm{x})$ since $\xi=\mathrm{max}\{p(y^{z_{m}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\dots,y^{z_{m-1}}=1,\bm{x}),p(y^{z_{m-1}}=1|\bar{y}^{j}=1,y^{z_{1}}=1,\dots,y^{z_{m-2}}=1,y^{z_{m}}=1,\bm{x}),\dots,p(y^{z_{1}}=1|\bar{y}^{j}=1,y^{z_{2}}=1,\dots,y^{z_{m}}=1,\bm{x})\}$ $(\xi\in(0,1])$ . By the same calculation way, we can obtain $\mathbf{T}_{z_{2}j},\dots,\mathbf{T}_{z_{m}j}\geq\frac{1}{\xi^{m}}p(\bar{y}^{j}=1|\bm{x})$ . Based on induction, we can summarize the difference $\ell_{j}=\sum_{k=1}^{K}|\mathbf{T}_{kj}-\mathbf{Q}_{kj}|\geq m(\frac{1}{\xi^{m}}-1)p(\bar{y}^{j}=1|\bm{x})$ . ∎

Appendix D The Proof of Theorem 6

Theorem 6. With Assumption 5, suppose the transition matrix $\mathbf{T}$ is invertible, then the ML-CLL optimal classifier $\bm{f}_{CL}^{*}$ converges to the MLL optimal classifier $\bm{f}^{*}$ , i.e., $\bm{f}_{CL}^{*}=\bm{f}^{*}$ .

Proof.

We prove $\bm{f}^{*}$ is also the optimal classifier for ML-CLL via substituting $\bm{f}^{*}$ into the ML-CLL risk:

[TABLE]

According to the proof of [8], $\bm{f}_{CL}^{*}=\mathbf{T}^{T}\bm{f}^{*}$ . So we find the optimal $\bm{f}^{*}$ ensuring $\bm{f}_{CL}^{*}=\bm{f}^{*}$ when the transition matrix $\mathbf{T}$ is invertible and Assumption 5 is satisfied. ∎

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algorithms,” IEEE Trans. Knowl. Data Eng. , vol. 26, no. 8, pp. 1819–1837, 2014.
2[2] M.-L. Zhang and L. Wu, “Lift: Multi-label learning with label-specific features,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 37, no. 1, pp. 107–120, 2015.
3[3] T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers, “Statistical topic models for multi-label document classification,” Mach. Learn. , vol. 88, no. 1-2, pp. 157–208, 2012.
4[4] P.-J. Tang, M. Jiang, B. N. Xia, J. W. Pitera, J. Welser, and N. V. Chawla, “Multi-label patent categorization with non-local attention-based graph convolutional network,” in Proceedings of the 34th Conference on Artificial Intelligence , York, NY, 2020, pp. 9024–9031.
5[5] A. Lambrecht and C. Tucker, “When does retargeting work? information specificity in online advertising,” Journal of Marketing research , vol. 50, no. 5, pp. 561–576, 2013.
6[6] T. Ishida, G. Niu, W.-H. Hu, and M. Sugiyama, “Learning from complementary labels,” in Advances in Neural Information Processing Systems 30 , Long Beach, CA, 2017, pp. 5639–5649.
7[7] T. Ishida, G. Niu, A. K. Menon, and M. Sugiyama, “Complementary-label learning for arbitrary losses and models,” in Proceedings of the 36th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, Long Beach, CA, 2019, pp. 2971–2980.
8[8] X.-Y. Yu, T.-L. Liu, M.-M. Gong, and D.-C. Tao, “Learning with biased complementary labels,” in Proceedings of the 15th European Conference on Computer Vision , Munich, Germany, 2018, pp. 69–85.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Complementary to Multiple Labels: A Correlation-Aware Correction Approach

Abstract

Index Terms:

1 Introduction

2 Related Work

2.1 Multi-Label Learning

2.2 Partial Multi-Label Learning

2.3 Multi-Class Complementary Label Learning

3 Problem Setup

4 The Proposed Approach

4.1 Transition Matrix for ML-CLL

Theorem 1**.**

4.2 Distortion in Estimating the Transition Matrix

Definition 2**.**

Theorem 3**.**

Corollary 4**.**

4.3 Estimation T\mathbf{T}T with Label Correlations

4.4 A Classifier-Consistent Approach

Assumption 5**.**

Theorem 6**.**

5 Regularization-Based Enhancement

5.1 An MSE-Based Regularization

5.2 Incorporation of Relevant Labels

6 Experiments

6.1 Experimental Settings

6.2 Comparison on Uniform Complementary Labels

6.3 Comparison on Biased Complementary Labels

6.4 Additional Experiments

6.5 Combination of Complementary Labels and Relevant Labels

7 Conclusion

Appendix A The Proof of Theorem 1

Proof.

Appendix B The Proof of Theorem 3

Proof.

Appendix C The Proof of Corollary 4

Proof.

Appendix D The Proof of Theorem 6

Proof.

Theorem 1.

Definition 2.

Theorem 3.

Corollary 4.

4.3 Estimation $\mathbf{T}$ with Label Correlations

Assumption 5.

Theorem 6.