Complementary to Multiple Labels: A Correlation-Aware Correction Approach
Yi Gao, Miao Xu, Min-Ling Zhang

TL;DR
This paper introduces a correlation-aware correction method for multi-labeled complementary label learning, addressing the challenge of estimating transition matrices without multi-labeled data and improving multi-label classification accuracy.
Contribution
It proposes a novel two-step transition matrix estimation approach that incorporates label correlations, enhancing multi-label complementary label learning performance.
Findings
The proposed method outperforms existing approaches in experiments.
The correction of transition matrices improves multi-label classification accuracy.
The approach is classifier-consistent and mitigates noise overfitting.
Abstract
\textit{Complementary label learning} (CLL) requires annotators to give \emph{irrelevant} labels instead of relevant labels for instances. Currently, CLL has shown its promising performance on multi-class data by estimating a transition matrix. However, current multi-class CLL techniques cannot work well on multi-labeled data since they assume each instance is associated with one label while each multi-labeled instance is relevant to multiple labels. Here, we show theoretically how the estimated transition matrix in multi-class CLL could be distorted in multi-labeled cases as they ignore co-existing relevant labels. Moreover, theoretical findings reveal that calculating a transition matrix from label correlations in \textit{multi-labeled CLL} (ML-CLL) needs multi-labeled data, while this is unavailable for ML-CLL. To solve this issue, we propose a two-step method to estimate the…
| Datasets | ||||
|---|---|---|---|---|
| scene | 2407 | 294 | 6 | 1.07 |
| yeast | 2417 | 103 | 14 | 4.23 |
| eurlex_dc | 8636 | 5000 | 15 | 1.02 |
| eurlex_sm | 13270 | 5000 | 15 | 1.74 |
| corel5k | 4194 | 499 | 15 | 1.70 |
| corel16k | 11103 | 120 | 15 | 1.77 |
| bookmark | 38912 | 2150 | 15 | 1.25 |
| delicious | 14784 | 500 | 15 | 4.32 |
| Methods | ML-KNN | LIFT | fpml | PML-lc | PML-LRS | L-UW | MLCL |
| Ranking loss | |||||||
| scene | .340±.032 | .289±.020 | .504±.025 | .490±.025 | .258±.007 | .372±.028 | .259±.030 |
| yeast | .247±.012 | .298±.012 | .233±.013 | .251±.015 | .464±.019 | .214±.011 | .211±.013 |
| eurlex_dc | .303±.016 | .286±.016 | .488±.033 | .347±.025 | .316±.011 | .598±.024 | .229±.026 |
| eurlex_sm | .336±.010 | .346±.012 | .488±.006 | .436±.011 | .332±.009 | .646±.015 | .312±.014 |
| corel5k | .379±.034 | .433±.037 | .444±.026 | .406±.075 | .334±.009 | .367±.031 | .349±.035 |
| corel16k | .328±.047 | .392±.027 | .420±.033 | .457±.046 | .303±.005 | .303±.035 | .289±.042 |
| bookmark | .384±.006 | .310±.007 | .469±.019 | .454±.036 | .260±.004 | .303±.010 | .252±.013 |
| delicious | .398±.004 | .383±.003 | .438±.008 | .445±.015 | .305±.002 | .302±.006 | .310±.003 |
| One Error | |||||||
| scene | .692±.030 | .605±.023 | .815±.027 | .717±.021 | .540±.023 | .609±.041 | .427±.018 |
| yeast | .297±.029 | .284±.028 | .251±.025 | .583±.026 | .738±.102 | .251±.025 | .251±.023 |
| eurlex_dc | .776±.031 | .670±.013 | .925±.016 | .774±.015 | .847±.010 | .837±.034 | .594±.035 |
| eurlex_sm | .689±.012 | .679±.009 | .872±.011 | .662±.012 | .731±.005 | .696±.008 | .656±.029 |
| corel5k | .815±.048 | .842±.056 | .854±.035 | .811±.062 | .756±.010 | .769±.034 | .736±.065 |
| corel16k | .736±.056 | .789±.046 | .816±.025 | .946±.028 | .730±.000 | .693±.057 | .690±.056 |
| bookmark | .801±.006 | .649±.016 | .885±.020 | .798±.005 | .584±.005 | .590±.022 | .509±.012 |
| delicious | .592±.018 | .533±.015 | .618±.017 | .679±.011 | .452±.007 | .467±.023 | .448±.016 |
| Hamming loss | |||||||
| scene | .820±.002 | .820±.003 | .819±.002 | .251±.007 | .814±.000 | .518±.042 | .264±.027 |
| yeast | .697±.012 | .697±.013 | .697±.013 | .268±.010 | .316±.000 | .243±.010 | .235±.008 |
| eurlex_dc | .932±.000 | .932±.000 | .118±.006 | .104±.002 | .890±.039 | .806±.015 | .092±.005 |
| eurlex_sm | .883±.001 | .883±.001 | .148±.005 | .138±.002 | .825±.027 | .773±.008 | .139±.005 |
| corel5k | .886±.007 | .887±.007 | .887±.007 | .155±.004 | .869±.002 | .463±.018 | .229±.068 |
| corel16k | .882±.009 | .882±.009 | .882±.009 | .177±.011 | .862±.001 | .423±.033 | .202±.067 |
| bookmark | .917±.001 | .916±.001 | .420±.009 | .123±.001 | .813±.001 | .409±.014 | .140±.004 |
| delicious | .711±.003 | .711±.003 | .711±.003 | .394±.011 | .459±.002 | .369±.027 | .289±.004 |
| Coverage | |||||||
| scene | .299±.026 | .256±.017 | .434±.021 | .420±.021 | .230±.006 | .328±.022 | .234±.025 |
| yeast | .579±.018 | .649±.020 | .553±.033 | .506±.023 | .742±.027 | .525±.017 | .525±.021 |
| eurlex_dc | .285±.014 | .269±.015 | .458±.031 | .326±.023 | .298±.010 | .334±.017 | .204±.023 |
| eurlex_sm | .416±.010 | .427±.013 | .569±.010 | .509±.013 | .419±.010 | .519±.008 | .365±.014 |
| corel5k | .473±.034 | .516±.035 | .529±.028 | .492±.072 | .429±.008 | .457±.038 | .445±.048 |
| corel16k | .430±.044 | .488±.027 | .513±.035 | .537±.051 | .405±.008 | .407±.033 | .393±.042 |
| bookmark | .359±.007 | .328±.008 | .475±.019 | .458±.035 | .280±.004 | .292±.011 | .279±.011 |
| delicious | .712±.006 | .703±.004 | .726±.009 | .695±.009 | .609±.003 | .613±.006 | .632±.007 |
| Average Precision | |||||||
| scene | .543±.024 | .600±.017 | .417±.021 | .465±.018 | .637±.011 | .568±.026 | .699±.017 |
| yeast | .677±.019 | .636±.017 | .688±.017 | .610±.016 | .459±.032 | .712±.020 | .718±.019 |
| eurlex_dc | .412±.018 | .471±.012 | .232±.022 | .373±.015 | .346±.009 | .250±.031 | .549±.025 |
| eurlex_sm | .419±.010 | .421±.010 | .273±.006 | .367±.009 | .402±.005 | .285±.009 | .474±.017 |
| corel5k | .355±.035 | .307±.038 | .297±.023 | .330±.044 | .397±.010 | .371±.028 | .391±.037 |
| corel16k | .405±.050 | .350±.035 | .325±.022 | .248±.026 | .424±.006 | .437±.044 | .449±.049 |
| bookmark | .383±.007 | .480±.010 | .267±.019 | .329±.016 | .534±.004 | .506±.014 | .584±.013 |
| delicious | .487±.006 | .511±.004 | .457±.006 | .446±.010 | .580±.002 | .570±.009 | .572±.005 |
| Methods | ML-KNN | LIFT | fpml | PML-lc | PML-LRS | L-UW | MLCL |
| Ranking loss | |||||||
| scene | .086±.015 | .319±.025 | .486±.027 | .492±.019 | .258±.013 | .368±.025 | .326±.050 |
| yeast | .240±.014 | .297±.016 | .227±.013 | .248±.012 | .454±.024 | .202±.012 | .199±.012 |
| eurlex_dc | .668±.009 | .636±.021 | .537±.015 | .349±.028 | .326±.009 | .586±.036 | .308±.034 |
| eurlex_sm | .364±.020 | .392±.014 | .499±.019 | .447±.012 | .333±.009 | .641±.015 | .316±.016 |
| corel5k | .324±.038 | .431±.030 | .474±.028 | .386±.047 | .357±.012 | .382±.033 | .358±.039 |
| corel16k | .413±.063 | .431±.041 | .454±.033 | .471±.068 | .375±.015 | .373±.029 | .357±.040 |
| bookmark | .567±.007 | .449±.042 | .552±.018 | .491±.016 | .244±.003 | .326±.008 | .211±.011 |
| delicious | .430±.005 | .413±.005 | .452±.008 | .433±.011 | .314±.003 | .349±.012 | .360±.008 |
| One Error | |||||||
| scene | .228±.032 | .669±.043 | .803±.038 | .720±.018 | .613±.017 | .696±.025 | .553±.054 |
| yeast | .330±.032 | .280±.025 | .254±.028 | .583±.027 | .546±.097 | .256±.025 | .254±.024 |
| eurlex_dc | .977±.005 | .959±.014 | .947±.008 | .774±.015 | .822±.004 | .822±.038 | .695±.074 |
| eurlex_sm | .699±.016 | .753±.036 | .886±.024 | .664±.014 | .737±.011 | .704±.012 | .650±.045 |
| corel5k | .738±.067 | .851±.038 | .861±.034 | .828±.059 | .747±.016 | .792±.039 | .752±.037 |
| corel16k | .780±.061 | .827±.049 | .837±.025 | .952±.021 | .730±.000 | .731±.053 | .707±.063 |
| bookmark | .906±.007 | .804±.037 | .925±.008 | .792±.004 | .576±.003 | .635±.022 | .502±.008 |
| delicious | .585±.012 | .557±.013 | .617±.025 | .681±.012 | .434±.006 | .485±.016 | .463±.017 |
| Hamming loss | |||||||
| scene | .088±.009 | .819±.002 | .820±.002 | .252±.006 | .814±.000 | .523±.048 | .290±.029 |
| yeast | .697±.012 | .697±.013 | .697±.013 | .268±.010 | .316±.000 | .253±.017 | .239±.008 |
| eurlex_dc | .932±.000 | .932±.000 | .118±.007 | .104±.002 | .889±.039 | .799±.035 | .109±.011 |
| eurlex_sm | .883±.001 | .883±.001 | .148±.005 | .139±.002 | .825±.027 | .772±.009 | .138±.007 |
| corel5k | .114±.008 | .887±.007 | .887±.007 | .157±.003 | .869±.002 | .498±.012 | .208±.033 |
| corel16k | .882±.009 | .882±.009 | .882±.009 | .178±.010 | .862±.001 | .481±.028 | .207±.086 |
| bookmark | .917±.001 | .916±.001 | .419±.009 | .122±.001 | .813±.003 | .549±.046 | .146±.003 |
| delicious | .711±.003 | .711±.003 | .711±.003 | .388±.013 | .459±.002 | .453±.015 | .304±.005 |
| Coverage | |||||||
| scene | .086±.013 | .280±.020 | .420±.023 | .420±.016 | .229±.011 | .321±.021 | .286±.041 |
| yeast | .551±.017 | .638±.028 | .533±.012 | .493±.025 | .723±.040 | .500±.018 | .498±.021 |
| eurlex_dc | .626±.008 | .596±.019 | .504±.014 | .328±.026 | .306±.009 | .333±.018 | .274±.030 |
| eurlex_sm | .432±.018 | .456±.014 | .579±.015 | .520±.015 | .418±.009 | .512±.009 | .362±.016 |
| corel5k | .419±.055 | .515±.024 | .555±.031 | .480±.041 | .451±.013 | .470±.036 | .449±.038 |
| corel16k | .498±.052 | .521±.038 | .542±.035 | .533±.066 | .454±.018 | .468±.030 | .453±.039 |
| bookmark | .565±.006 | .455±.039 | .553±.017 | .492±.014 | .265±.003 | .308±.013 | .231±.011 |
| delicious | .736±.004 | .723±.005 | .737±.008 | .691±.009 | .625±.003 | .671±.012 | .688±.006 |
| Average Precision | |||||||
| scene | .860±.020 | .559±.028 | .428±.026 | .462±.014 | .608±.013 | .529±.020 | .618±.046 |
| yeast | .670±.023 | .634±.016 | .691±.022 | .614±.015 | .500±.026 | .719±.020 | .726±.018 |
| eurlex_dc | .145±.005 | .166±.016 | .201±.009 | .371±.020 | .357±.005 | .266±.031 | .456±.061 |
| eurlex_sm | .405±.013 | .373±.016 | .262±.016 | .366±.010 | .400±.007 | .282±.011 | .482±.025 |
| corel5k | .409±.040 | .300±.030 | .282±.017 | .325±.048 | .392±.017 | .352±.032 | .380±.037 |
| corel16k | .355±.054 | .318±.033 | .301±.024 | .240±.030 | .393±.054 | .384±.036 | .407±.047 |
| bookmark | .219±.004 | .320±.037 | .212±.007 | .320±.004 | .544±.003 | .469±.014 | .599±.008 |
| delicious | .473±.006 | .490±.006 | .450±.008 | .449±.010 | .581±.002 | .544±.010 | .544±.009 |
| Methods | Uniform complementary labels | Biased complementary labels | ||||||
|---|---|---|---|---|---|---|---|---|
| scene | yeast | eurlex_dc | corel5k | scene | yeast | eurlex_dc | corel5k | |
| Hamming loss | ||||||||
| MLCL | .264±.027 | .235±.008 | .092±.005 | .229±.068 | .290±.029 | .239±.008 | .109±.011 | .208±.033 |
| Without | .290±.039 | .421±.011 | .109±.018 | .466±.025 | .294±.029 | .409±.012 | .088±.004 | .444±.031 |
| Without | .510±.044 | .229±.007 | .509±.043 | .461±.053 | .481±.047 | .230±.009 | .512±.046 | .489±.036 |
| Ranking loss | ||||||||
| MLCL | .259±.030 | .211±.013 | .229±.026 | .349±.035 | .326±.050 | .199±.012 | .308±.034 | .358±.039 |
| Without | .282±.063 | .419±.018 | .277±.041 | .487±.021 | .348±.046 | .406±.016 | .268±.024 | .467±.026 |
| Without | .379±.024 | .216±.010 | .303±.028 | .362±.030 | .353±.018 | .204±.011 | .320±.025 | .387±.027 |
| One error | ||||||||
| MLCL | .427±.018 | .251±.023 | .594±.035 | .736±.065 | .553±.054 | .254±.024 | .695±.074 | .752±.037 |
| Without | .474±.047 | .633±.043 | .708±.106 | .866±.019 | .560±.042 | .612±.051 | .564±.029 | .855±.027 |
| Without | .607±.037 | .250±.025 | .740±.048 | .734±.058 | .686±.013 | .256±.025 | .753±.044 | .773±.068 |
| Coverage | ||||||||
| MLCL | .234±.025 | .525±.021 | .204±.023 | .445±.048 | .286±.041 | .498±.021 | .274±.030 | .449±.038 |
| Without | .255±.055 | .683±.029 | .247±.035 | .565±.032 | .306±.039 | .660±.023 | .240±.023 | .547±.031 |
| Without | .334±.020 | .527±.011 | .249±.024 | .451±.035 | .310±.015 | .501±.015 | .265±.022 | .473±.023 |
| Average precision | ||||||||
| MLCL | .699±.017 | .718±.019 | .549±.025 | .391±.037 | .618±.046 | .726±.018 | .456±.061 | .380±.037 |
| Without | .671±.045 | .472±.018 | .469±.085 | .274±.014 | .611±.038 | .489±.015 | .447±.021 | .289±.022 |
| Without | .566±.023 | .711±.019 | .426±.040 | .389±.041 | .541±.013 | .717±.020 | .411±.034 | .359±.050 |
| scene | yeast | eurlex_dc | eurlex_sm | corel5k | corel16k | bookmark | delicious | |
|---|---|---|---|---|---|---|---|---|
| 0.1 | .678±.017 | .714±.019 | .545±.019 | .451±.025 | .374±.033 | .444±.046 | .565±.007 | .554±.005 |
| 0.3 | .683±.015 | .716±.018 | .549±.021 | .460±.021 | .378±.032 | .447±.047 | .579±.011 | .565±.005 |
| 0.5 | .687±.016 | .718±.018 | .547±.022 | .463±.016 | .385±.031 | .447±.048 | .583±.008 | .575±.005 |
| 0.8 | .693±.016 | .718±.018 | .541±.022 | .469±.018 | .387±.037 | .448±.048 | .582±.007 | .572±.006 |
| 1 | .699±.017 | .718±.019 | .549±.025 | .474±.017 | .391±.037 | .449±.049 | .584±.013 | .572±.005 |
| Datasets | scene | yeast | eurlex_dc | eurlex_sm | corel5k | corel16k | bookmark | delicious |
|---|---|---|---|---|---|---|---|---|
| Hamming loss | ||||||||
| Fully supervised | .120±.013 | .208±.009 | .004±.000 | .033±.001 | .198±.012 | .196±.012 | .098±.004 | .276±.006 |
| CL | .264±.027 | .235±.008 | .092±.005 | .139±.005 | .229±.068 | .202±.067 | .140±.004 | .289±.004 |
| CL & RL | .124±.008 | .225±.010 | .005±.001 | .053±.002 | .178±.012 | .172±.010 | .085±.002 | .285±.004 |
| Ranking loss | ||||||||
| Fully supervised | .075±.009 | .169±.009 | .003±.001 | .019±.001 | .258±.029 | .222±.029 | .090±.005 | .226±.004 |
| CL | .259±.030 | .211±.013 | .229±.026 | .312±.014 | .349±.035 | .289±.042 | .252±.013 | .310±.003 |
| CL & RL | .082±.011 | .191±.011 | .005±.001 | .044±.002 | .268±.031 | .227±.021 | .102±.004 | .267±.004 |
| One Error | ||||||||
| Fully supervised | .222±.032 | .223±.023 | .019±.004 | .069±.005 | .627±.038 | .588±.056 | .313±.009 | .340±.012 |
| CL | .427±.018 | .251±.023 | .594±.035 | .656±.029 | .736±.065 | .690±.056 | .509±.012 | .448±.016 |
| CL & RL | .229±.033 | .255±.032 | .022±.005 | .098±.007 | .639±.040 | .600±.044 | .324±.007 | .398±.017 |
| Coverage | ||||||||
| Fully supervised | .077±.009 | .451±.019 | .004±.000 | .074±.002 | .347±.044 | .315±.024 | .112±.005 | .527±.007 |
| CL | .234±.025 | .525±.021 | .204±.023 | .365±.014 | .445±.048 | .393±.042 | .279±.011 | .632±.007 |
| CL & RL | .084±.010 | .474±.021 | .006±.001 | .113±.004 | .363±.048 | .326±.020 | .125±.004 | .564±.006 |
| Average Precision | ||||||||
| Fully supervised | .868±.018 | .760±.015 | .988±.003 | .943±.004 | .494±.024 | .530±.038 | .766±.007 | .662±.005 |
| CL | .699±.017 | .718±.019 | .549±.025 | .474±.017 | .391±.037 | .449±.049 | .584±.013 | .572±.005 |
| CL & RL | .860±.019 | .734±.018 | .985±.004 | .899±.004 | .485±.028 | .523±.030 | .753±.006 | .618±.005 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Music and Audio Processing · Machine Learning and Data Classification
Complementary to Multiple Labels: A Correlation-Aware Correction Approach
Yi Gao, Miao Xu, and Min-Ling Zhang Yi Gao is with the School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China and the Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China. E-mail: [email protected] Miao Xu is with The University of Queensland, Australia. E-mail: [email protected] Min-Ling Zhang (corresponding author) is with the School of Computer Science and Engineering,Southeast University, Nanjing 210096, China and the Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China. E-mail: [email protected] Manuscript received April 19, 2005; revised August 26, 2015.
Abstract
Complementary label learning (CLL) requires annotators to give irrelevant labels instead of relevant labels for instances. Currently, CLL has shown its promising performance on multi-class data by estimating a transition matrix. However, current multi-class CLL techniques cannot work well on multi-labeled data since they assume each instance is associated with one label while each multi-labeled instance is relevant to multiple labels. Here, we show theoretically how the estimated transition matrix in multi-class CLL could be distorted in multi-labeled cases as they ignore co-existing relevant labels. Moreover, theoretical findings reveal that calculating a transition matrix from label correlations in multi-labeled CLL (ML-CLL) needs multi-labeled data, while this is unavailable for ML-CLL. To solve this issue, we propose a two-step method to estimate the transition matrix from candidate labels. Specifically, we first estimate an initial transition matrix by decomposing the multi-label problem into a series of binary classification problems, then the initial transition matrix is corrected by label correlations to enforce the addition of relationships among labels. We further show that the proposal is classifier-consistent, and additionally introduce an MSE-based regularizer to alleviate the tendency of BCE loss overfitting to noises. Experimental results have demonstrated the effectiveness of the proposed method.
Index Terms:
Complementary label learning, multi-label learning, transition matrix, label correlations.
1 Introduction
In multi-label learning (MLL), each instance is associated with a set of relevant labels, where the learned classifier aims to predict all relevant labels of unseen instances [1, 2]. MLL is widely used in many real-world applications, such as text categorization [3, 4], image retrieval [5], etc. However, collecting precisely multi-labeled data is laborious because of the unknown number of relevant labels per instance and the existence of complex semantic labels. For the example image in Fig. 1, besides the label Architecture, there exist other relevant labels whose accurate annotation needs one-by-one checking of the whole label space; in addition, annotators need special geographical and cultural domain knowledge to accurately label the image as Paris.
To release the laborious of annotating multi-labeled data, we explore the problem setting of multi-labeled CLL (ML-CLL), where each instance is associated with a single complementary label (an irrelevant label of the instance) instead of multiple relevant labels. Providing such weakly supervised information will ease the labeling process in large label space because selecting one complementary label is low-cost and requires less domain knowledge than selecting all relevant labels. One example of ML-CLL is given in Fig. 1 when selecting desert as the complementary label. Given the complementary label, the goal of ML-CLL is still the same as fully supervised MLL, i.e., learning a model that can accurately predict multiple relevant labels for unseen instances.
The setting of CLL was initially applied in the multi-class learning task [6, 7, 8, 9, 10, 11, 12]. Previous multi-class CLL approaches are based on an estimated transition matrix that summarizes the probability of a label being selected as a complementary label [6, 7, 8]. Although they have achieved a promising performance on multi-class data, they are restricted to the case where an instance is associated with only one relevant label. In this case, multi-class CLL approaches only consider the exclusive relationship among labels, while these approaches ignore that labels can bear other relationships in the multi-labeled case, especially the co-occurrence of labels. In fact, relationships among labels are crucial to solving ML-CLL problems since the selection of a complementary label of an instance in MLL is the combined result against multiple relevant labels rather than against only a single relevant label. Misusing a technique targeting against a single relevant label to the multiple relevant labels case will result in a wrongly estimated transition matrix.
In this paper, we first theoretically analyze how the estimation of the transition matrix using the current multi-class CLL techniques could be distorted in multi-labeled cases. According to these findings, we observe that estimating the transition matrix in ML-CLL from label correlations needs to know relevant labels of instances, while these are unavailable. To remove this pain, we propose a two-step method to estimate the transition matrix in ML-CLL from candidate labels which are the complement of complementary labels. Our strategy includes: (1) estimating an initial transition matrix by decomposing the multi-label problem into binary classification problems; (2) using label correlations to correct the initial transition matrix by enforcing the addition of relationships among labels. The fast convergence of Cross-Entropy (CE) loss benefits from focusing on instances that are difficult to classify, which may result in CE loss overfitting to noisy labeled data. As a type of CE loss, Binary CE (BCE) loss has the same problem. The study of [13] indicates that Mean square error (MSE) loss is less sensitive to noisy labels than CE loss. As Binary CE (BCE) loss is a benchmark of our approach, an MSE-based regularizer is further introduced to alleviate the tendency of it overfitting to noises.
In addition, we show that our proposed ML-CLL can be easily combined with learning from relevant labels, which significantly extends the application scenario of the proposed algorithm. This combination is particularly useful, e.g. when labels are collected via crowdsourcing[14] where crowdworkers are asked to randomly select a complementary label and one or more relevant labels for an instance. Experimental results on various datasets demonstrate the effectiveness of the proposed approach. Especially in situation when each instance is only equipped with a complementary label and a relevant label, our proposal has superior performance, even comparable with the performance on fully supervised data. Our main contributions are summarized as follows:
- •
We theoretically analyze the distortion of the transition matrix estimated by multi-class CLL in multi-labeled cases, because multi-class CLL techniques ignore the co-existence of relevant labels. Theoretical findings reveal that multi-labeled data is indispensable for calculating the transition matrix from label correlations.
- •
To solve the problem of unavailable multi-labeled data, we propose a two-step method to estimate the transition matrix from candidate labels. Moreover, we show theoretically that the proposed approach is classifier-consistent under a mild assumption.
- •
We introduce a practical strategy – MSE-based regularization – to alleviate the overfitting tendency of BCE loss. Our empirical study shows that the proposal obtains comparable performance with state-of-the-art baselines, which proves the effectiveness of our approach.
The rest of this paper are organized as follows. Section 2 briefly reviews related work of ML-CLL. Then we formalize the ML-CLL problem in Section 3, analyze it theoretically and describe our approach in Section 4. In Section 5, we introduce an MSE-based regularization and show how to adapt our method to bear an additional small amount of relevant labels. The experimental results are given in Section 6 and we conclude in Section 7.
2 Related Work
In this section, we will give a brief review of related work of ML-CLL, including MLL, partial multi-label learning (PML) and multi-class CLL.
2.1 Multi-Label Learning
MLL problems aim to train a classifier that can predict a set of relevant labels for an unseen instance, where each training instance is associated with multiple relevant labels simultaneously. With the complexity of label correlation, the previous studies can be grouped into three categories[15, 16, 17, 18]: first-order approach [19, 20, 21], second-order approach [22, 23] and high-order approach [24, 25]. To solve MLL problems, the first-order approach decomposes MLL problems into a set of binary classification problems [19, 20]. However, these approaches ignore label correlations among labels, which play a crucial role in MLL [15]. After realizing the importance of label correlation, more and more studies attempt to exploit it to improve MLL performance. Among them, the second-order approach considers the pairwise label correlations that refer to the relationship between two labels. The kind of these approaches generally transform MLL problems into bipartite ranking problems by enforcing that relevant labels should be ranked higher than irrelevant labels [26, 27, 23]. Beyond second-order relationship, there exists more complex relationship between labels in many real-world scenarios. Therefore, many approaches begin to exploit high-order label correlations to handle the MLL problems recently [28, 24, 29, 30]. For example, Zhao et al. [30] leverage variational autoencoder to facilitate the learning process via exploiting high-order correlations among labels, while Wang et al. and Xun et al. [31, 32] both design special neural network blocks to automatically extract label correlations to improve the label prediction performance. Although high-order approaches have the ability of stronger label correlation-modeling, they may suffer from high computational cost comparing to first and second-orders approaches [33].
2.2 Partial Multi-Label Learning
Due to that the fully supervised data is difficult to collect, many reseachers tend to explore the weakly supervision data form to alleviate the heavy load of labeled data collection [34]. PML is a recently emerging weakly supervised approch firstly proposed by Xie et al. [35]. In PML, each training instance is associated with a set of candidate labels that consist of relevant labels and irrelevant (noisy) labels and the goal is to learn a classifier assigning a set of labels accurately for unseen instances.
At the first glance, it seems that ML-CLL is an extreme case of PML, such that all PML methods are also applicable to ML-CLL. However, existing PML methods assume that noisy only composes a small portion in the candidate labels [36, 37, 38, 33], such that many approaches [37, 38, 33] adopt matrix factorization matrix factorization to tackle PML problems, which decompose the candidate label matrix into the low-rank multi-label matrix and the sparse noisy label matrix. Compared to PML, the studied ML-CLL problem in this paper are target at the problem with only one complementary label, resulting in a high-noise PML problem on which the existing approaches can not be applicable. We will demonstrate the performance difference in the experimental part.
2.3 Multi-Class Complementary Label Learning
Currently, CLL problem is only considered in multi-class learning, whose goal is to predict a single relevant label per instance precisely from complementary labeled data. Previous approaches can be roughly grouped into two categories: (1) modeling the generative relationship between the complementary label and the relevant label [6, 12, 7, 8, 39]; (2) modeling the probability of complementary labels from the learned discriminative classifier directly [10, 9, 11].
The first multi-class CLL method belongs to category one. It models the generative relationship between complementary labels and relevant labels, and uses a such generative process to rewrite one-versus-all and pairwise comparison loss functions to derive an unbiased risk estimator [6]. Ishida et al. [7] realize that the method of [6] is restricted to loss functions and propose a new method which can use arbitrary losses and models. A typical way to make use of the modeled generative process is through a transition matrix, which summarizes the probabilities of a label being complementary labels when relevant labels are given. Then, approaches apply a transition matrix to recover relevant labels from complementary labels [8, 7, 39]. Compared with [6, 7], transition matrix-based methods can map more complex generative relationship rather than uniform one only. Therefore, we tend to design a transition matrix-based method to solve ML-CLL problem with a different estimating way.
Differ from category one, approaches residing in category two directly model the probabilities of complementary labels from the learned classifier without the generative relationship [9, 10, 11]. Chou et al. propose a surrogate complementary loss framework based on complementary labels providing negative feedback during the training process [9]. Although its losses fail to derive an unbiased risk estimator, it achieves good performance on the multi-class CLL. In light of the property of the complementary label that the predictive probability of the complementary label is expected to approach zero, [10] and [11] propose a discriminative solution by directly modeling the probabilities of complementary labels from learned classifier to avoid the generative assumption. Due to that multi-class CLL approaches are designed for a single relevant label case, which are not suitable for the ML-CLL case that an instance is associated with multiple labels simultaneously. We will demonstrate that in the experimental part.
3 Problem Setup
In MLL, let be the feature space and be the finite label space with possible class labels (). A multi-label instance is equipped with a set of relevant labels . is independently sampled from an unknown joint probability distribution . Here we exclude the special cases of nor to ensure relevant labels and complementary labels both exist. For convenience, we use a binary vector to denote , where indicates that is relevant to and [math] otherwise. Suppose is the training set with instances. The goal of MLL is to learn a multi-label classifier , which can predict a set of relevant labels for any unseen instance. Instead of learning directly, most MLL methods tend to learn a real-valued decision function via minimizing the expected risk
[TABLE]
where is a proper MLL loss function [30], such as BCE loss. is usually interpreted as a probability vector: is the -th entry of and predicts the confidence score that label is relevant to , i.e., if properly normalized then . Due to that is unknown, the expected risk is usually approximated by the empirical risk . If denoting the optimal classifier learned from the expected risk as , i.e., , then denotes the optimal classifier learned by minimizing the empirical risk, i.e., .
In ML-CLL studied in this paper, each training instance is equipped with a single complementary label. The complementary labeled instance is drawn from an unknown joint probability distribution , where is a complementary label of . can be presented as a -dimensional vector . If label is selected as the complementary label to (), then is one and all other elements are zero in . We utilize to denote the candidate label set of . Let a -dimension vector to be the corresponding vector representation of subset , where all elements are one except that the one corresponding to the complementary label is set to be zero ().
Let be the ML-CLL training set with instances. The expected risk of multi-labeled CLL is defined over :
[TABLE]
where denotes a ML-CLL loss, which will be proposed later this paper. Similarly, the corresponding empirical risk is described as .
4 The Proposed Approach
In this section, we first introduce the definition of the transition matrix in MLL and analyze why the estimated transition matrix using multi-class techniques is unsuitable for ML-CLL. Then, we describe an advanced two-step way to estimate the transition matrix in the MLL case. Finally, we prove our approach is classifier-consistent with a mild assumption.
4.1 Transition Matrix for ML-CLL
In ML-CLL, we start by introducing a transition matrix that summarizes the probabilities for a complementary label given a set of relevant labels. More specifically, the transition matrix is defined as where () is the -th label subset. If , then because the label has no chance to be selected as the complementary label. In this paper, we employ the same class-dependent assumption as the multi-class CLL approach [8]: as and are conditionally independent given . Then we can obtain the following equation:
[TABLE]
where we assume the label is a complementary label of . Then, according to Eq.(3), can be approximated by when the transition matrix is known. If considering all possible label subsets of as , we have , i.e., the size of depends on the size of the power set of . Practically, the power set of would be computationally prohibitive and even impossible to store, since is an extremely large number when the number of possible labels is large. To solve this combinatorial explosion problem, we explore a more practical way to use an alternative lower-dimensional transition matrix to replace the higher-dimensional one. We start investigating the feasibility of the alternative lower-dimensional matrix from Theorem 1.
Theorem 1**.**
Given an instance , suppose is the relevant label set and the label is the complementary label which is randomly selected. Then the following equality holds:
[TABLE]
The second inequality holds because of addition rule of probability. The detailed proof is in Appendix A. Theorem 1 shows that using to approximate is a lower bound of using to approximate . Observed by Eq.(3), we find that our main goal transforms from precisely predicting the relevant label set of to precisely predicting its complementary label via the transition matrix . This means that we need to maximize the predictive probability of the complementary label of , i.e., maximizing . From this point of view, Theorem 1 theoretically shows the feasibility of using a low-dimension transition matrix to replace the high-dimension , because we optimize by maximizing the lower bound of Eq.(3). Let denote the lower-dimensional transition matrix, where the )-th element of is , and when . Thus, we adopt the matrix as the transition matrix in the following of the paper to avoid the pain in computation and storage brought up by the matrix .
4.2 Distortion in Estimating the Transition Matrix
Before exploring how the transition matrix estimated by multi-class CLL is distorted from that of ML-CLL, we first introduce the transition matrix estimated by multi-class CLL techniques. Suppose be the transition matrix estimated in multi-class CLL. Recalling the approach [8], it estimates the transition matrix under a special assumption: for each label , existing an anchor set such that and (). With this assumption and regardless of label correlations, the estimation of is iff is sampled from , where is the -th row and -th column element of .
To measure the distortion between calculated in ML-CLL and the estimated , we define their difference on the complementary label of as follows
[TABLE]
The larger value of indicates that deviates further from . As we know, label correlations and co-occurred multiple labels are key properties of MLL. Due to that the correlations among labels are intricate, directly calculating from all label correlations will bring high computational cost. For convenience, we give a simple case of MLL including label correlations – at most two labels can co-occur for an instance, and the rest of labels are mutually exclusive – to facilitate us calculating from label correlations and explore the distortion of and . We start to study the above contents from the definition of mutually exclusive.
Definition 2**.**
For any , only a label is relevant to , i.e. , which labels are mutually exclusive.
Under the simple case in MLL, in Theorem 3, we state how to estimate directly from label correlations, and the distortion of and .
Theorem 3**.**
Under a MLL scenario: suppose the labels () are dependent, and the labels belonging to are mutually exclusive. For any , its label set and . Let the label () be the complementary label of . and calculated from label correlations satisfy
[TABLE]
where denotes the integer set . The difference of and on the complementary label is
[TABLE]
where .
The proof is provided in Appendix B. From Theorem 3, we can see that calculating the transition matrix from label correlations is more complex than estimating one without label correlations, and the relevant label sets of instances need to be known. Moreover, Theorem 3 shows that there is a distortion between and , which widely exists in multi-labeled cases since each multi-label instance is relevant to multiple labels. The above learning scenario only considers the pairwise label correlations, while there exists a more complex relationship among labels. Similarly, under a realizable computational cost, we construct another simple MLL scenario with more complex label relationships to explore factors that affect in Corollary 4.
Corollary 4**.**
Under a MLL scenario: there are () labels that are dependent, while the labels belong to are mutually exclusive. For any , its relevant set and . Suppose the label is the complementary label of . The difference between and has
[TABLE]
where .
The proof is shown in Appendix C. According to Corollary 4, when label correlations are more complex, the distortion of the transition matrix estimated by the multi-class CLL approach is more serious as increases. Meanwhile, it demonstrates that the ML-CLL problem cannot be solved by current techniques in multi-class CLL.
4.3 Estimation with Label Correlations
As discussed above, calculating the transition matrix from label correlations needs instances whose relevant label sets are known. Moreover, calculating is more and more difficult as relationships among labels become more complex by observing the results of in Theorem 3 and Corollary 4. Due to that multi-labeled data are unavailable for our setting, we propose a two-step method to estimate from candidate labels, and it can reduce the complexities in calculating from label correlations. This two-step method includes: (1) computing an initial transition matrix from candidate labels by decomposing the multi-label problem into a series of binary classification problem; (2) obtaining the final estimation of by using label correlations to correct .
Computing an initial transition matrix . Let be an initial transition probability, which is a -th element of . We caulculate from candidate labels of instances. Multiplication theorem of probability 111 is applied to calculate and ensure that the following equation holds:
[TABLE]
where and . In practice, can be approximated by the expectation of over the conditional distribution . Assuming and are conditionally independent given , so . Intuitively, can be approximated by the classifier learned from to predict the probability of complementary labels. Let denote the subset of in with , which satisfies the conditional distribution . Thus, can be estimated by
[TABLE]
Estimating with label correlations. The calculating procedure of lacks exactly supervised data. Observed by the transition probabilities of calculated from label correlations in subsection 4.2, we can find that they are affected by label correlations. Moreover, a label that is low-co-occurred to the relevant labels could be preferentially selected as the complementary label from the view of label correlations. For example, considering water as the relevant label; in this case, desert (low-co-occurred label) will have a larger chance to be selected as the complementary label compared to fish (high-co-occurred label). Motivated by these findings, we use label correlations to correct the initial matrix to estimate by enforcing the addition of relationships among labels.
Suppose be a label correlation matrix, where the element represents the correlation between labels and . The value of is larger when the correlation of labels and is stronger. Following [35, 40], we adopt the co-occurrence rate of two candidate labels as their correlations. Finally, the transition matrix can be estimated by , where if , and normalizing by row.
Fig. 2 is an example of refining procedure. As can be seen from the Fig. 2, though the estimated initial probability of is higher than in , the value of is lower than in . This is because the labels and have a strong correlation as shown in , so the label has a lower chance to be selected as the complementary label for the label . The corrected initial transition matrix agrees with our expectation on the low-co-occurred labels that tend to be selected as complementary labels preferentially. In practice, the estimation of depends on , where the classifier should perfectly model the probability of complementary labels. When data equipped with complementary labels is sufficiently, the perfect model is capable of modeling .
4.4 A Classifier-Consistent Approach
According to the transition matrix , we can derive the probability of complementary labels from multi-label classifier. Let be a complementary label classifier, which is defined as
[TABLE]
where is applied to approximate , refers to the -th element of . ML-CLL problems aim to recover a set of relevant labels per instance from a complementary label. Since training instances are associated with complementary labels, the common loss functions of MLL are unsuitable for ML-CLL. Therefore, we define a complementary loss function as
[TABLE]
Denote by the minimizer of , the minimizer of is used to approximated . Recalling the definition of classifier-consistent, if a classifier learned by an approach finally converges to the optimal classifier learned in MLL as the number of instances increases, then this approach is classifier-consistent [41, 42, 43]. We derive our proposal is classifier-consistent based on a mild assumption:
Assumption 5**.**
Suppose the transition matrix is invertible and can perfectly recover the relationship between relevant labels of and its complementary label. Then, we have .
With Assumption 5, our approach trained on can be inferred to be classifier-consistent, which is stated in Theorem 6. Naturally, Theorem 6 guarantees that the optimal classifier learned from complementary labeled data converges to the optimal one learned from fully supervised MLL.
Theorem 6**.**
With Assumption 5, suppose the transition matrix is invertible, then the ML-CLL optimal classifier converges to the MLL optimal classifier , i.e., .
The proof is represented in Appendix D. Thanks to BCE loss is a popular loss function in MLL, we adopt BCE loss as the base in this paper, then is expressed as
[TABLE]
where denotes a -dimensional vector with 1 for all elements.
5 Regularization-Based Enhancement
In this section, an MSE-based regularization of our approach is described. And we attempt to combine a small amount of relevant labels to explore more possibilities of our proposal.
5.1 An MSE-Based Regularization
Previous works indicate that CE loss always makes the model focus on hard instances that are difficult to be classified precisely, while MSE loss and Mean Absolute Error (MAE) loss are less sensitive to hard instances since they treat per instance coequally [44, 13]. As this property, the convergence rate of CE loss is superior to MSE loss and MAE loss, whereas this property makes CE loss more prone to the overfitting problem than MSE loss and MAE loss when noisy labels present at training data [44, 13]. Actually, an excellent approach can converge quickly during the training process, and shows good generalization ability and robustness for unseen instances[11].
Obviously, BCE loss has a similar property to CE loss, which results in an excellent convergence rate of approaches. Meanwhile, approaches based on BCE loss are easy to suffer from the overfitting problem when using noisy labeled data to learn. In fact, ML-CLL is a problem setting with dense noisy labels, BCE loss may cause the overfitting problem of a model in ML-CLL. To cope with this problem, we introduce an MSE-based regularizer based on MSE loss (i.e. -norm regularization) to balance the robust and convergence requirement of the proposed approach. Hence, the MSE-based regularizer is defined as:
[TABLE]
Finally, we combine the complementary loss and the MSE-based regularizer term, which leads to our target loss:
[TABLE]
where is the trade-off parameter and set as 1 (the selection shown in Section 6). The all procedure of the proposed approach (called MLCL) is shown in Algorithm 1.
5.2 Incorporation of Relevant Labels
In many practical situations, we can use complementary labels and relevant labels to learn more accurate classifiers, which is highly practical implementation. To this end, motivated by [6, 45], let us design a reasonable combination of the loss derived from complementary labeled data and relevant labeled data:
[TABLE]
where denotes a binary vector of relevant labels of , in which when the label . To provide more practicability, we do not restrict given relevant labels to must be equal to the set of relevant labels , which means and .
As explained in the instruction, we can naturally collect data associated with complementary labels and relevant labels via crowdsourcing [14]. Our loss function Eq.(12) can leverage both kinds of labeled data to learn better classifiers. We will experimentally show the usefulness of this combination method in Section 6.
6 Experiments
In this section, we will evaluate the effectiveness of MLCL, where five common MLL criteria, including ranking loss, hamming loss, one error, coverage and average precision, are employed in this paper. The values of first four criteria are smaller, the performance of approach is better. While the value of average precision is greater, the better the performance. The label set of is predicted by . All experiments use PyTorch [46] and NVIDIA TESLA K80 GPU to implement. The code will be released after this paper has been accepted.
6.1 Experimental Settings
Datasets. We use eight widely-used MLL datasets, namely corel5k, corel16k, delicious, eurlexdc, eurlexsm, yeast, bookmarks and scene, to our experiments222Publicly available at http://mulan.sourceforge.net/datasets.. Following [35, 36], we adopt the same pre-processing to deal with the datasets. More specifically, rare class labels are filtered out for datasets with more than 15 class labels, whose class labels are kept under 15. Accordingly, instances that are relevant with removed class labels are filtered out as well. Detailed characteristics of these datasets are shown in Table I.
Base models. The linear model is used as the base model.
Baselines. Two typical MLL approaches, ML-KNN [21] and LIFT [47], are utilized as baselines, which deal with ML-CLL via regarding all possible labels in the candidate label set as relevant labels for a training instance. Similarly, three recent PML approaches are employed as comparing approaches, including PML-lc [35], fpml [38] and PML-LRS [37], which learn from training instances associated with candidate labels. In addition, we employ a multi-class CLL approach, called L-UW [10], as a baseline, which uses BEC loss and sigmoid output layer instead of CE loss and softmax output layer respectively to make L-UW suit for multi-labeled data.
6.2 Comparison on Uniform Complementary Labels
Setup. Weight-decay is set as and learning rate is selected from for all data sets. We employ Adam [48] optimization method, and set the number of batch-size and epoch as 256 and 200 respectively. L-UW applies the same model and hyper-parameters as ours. Here, we estimate with a linear model. We use Ten-fold cross-validation to evaluate experiments, where training data is associated with complementary labels that are generated by randomly selecting one of possible labels excepting relevant labels (uniform complementary labels), and test data is equipped with the set of relevant labels. The mean metrics value and standard deviation (std) will be reported as final experimental results for all approaches.
Results. Table II is utilized to report experimental results of various approaches on eight data sets equipped with uniform complementary labels. indicates the larger/smaller the value, the better the performance.
According to reported results in Table II, we can observe that results of MLCL are superior or comparable performance against baselines out of different data sets on five criteria. Our approach achieves the best performance in most cases. Specifically, the proposed approach outperforms LIFT on eight datasets across all metrics. This is because our approach is better at tackling the issue that training data is associated with relevant labels and irrelevant labels simultaneously than fully supervised MLL algorithms. Furthermore, experimental results of PML-lc and PML-LRS are inferior to ours in most cases, which demonstrate that PML approaches are indeed inferior to our approach in cases of dense noisy labels. Similarly, based on the results of L-UW shown in Table II, we observe that our approach outperforms L-UW on almost all datasets and metrics other than ranking loss and coverage on the delicious dataset. This reflects that label correlations are important to solve ML-CLL problems, which leads to the proposed approach taking label correlations into account surpasses L-UW that ignores label correlations.
6.3 Comparison on Biased Complementary Labels
Setup. To evaluate the effectiveness of our approach in different situations, we utilize training data with biased complementary labels that are generated via the co-occurrence rate of relevant labels. Specifically, we select a complementary label of an instance from , and the selecting rule follows: the class label with a lower co-occurrence rate has a higher probability to be selected as a complementary label. We adopt training data with biased complementary labels to train the model, while test data is equipped with relevant label sets to evaluate the effectiveness of our approach. For other experimental settings, we apply same settings with Subsection 5.2.
Results. The mean and std of results on test data are shown in Table III. According to results shown in Table III, we can summarize the following impressive observations: (1) MLCL achieves superior or comparable performance to LIFT, fpml, PML-lc, PML-LRS and L-UW on different data sets, which proves that the proposed approach can predict the set of proper labels for unseen instances from complementary labeled data; (2) Although MLCL fails to achieve the best result on the scene dataset, our approach is better than other baselines in the rest of datasets, which indicates that our approach can effectively deal with ML-CLL problems than others. These observations demonstrate that the proposed method can both hold for the situation of data with uniform and biased complementary labels.
6.4 Additional Experiments
Ablation experiments. We then explore the effect of different learning components on MLCL performance. Table IV summarizes results of MLCL without the different component, which trains on the data with uniform complementary labels. In Table IV, without refers to MLCL directly use the estimated initial transition matrix to train, and without indicates that MLCL only utilizes Eq.(9) to optimaze.
From results reported in Table IV, the performance of MLCL surpasses that without different components in most cases, which shows that two components, including using label correlations to correct and an MSE-based regularizer, are beneficial for our approach to improve the performance. Especially, estimating based on label correlations pushes the proposed approach performance forward significantly compared with that without on most cases. Similarly, an MSE-based regularizer brings significant benefits for our approach, which demonstrates that an MSE-based regularizer balances the robustness and convergence rate of BCE loss. These indicate that using label correlations to estimate the transition matrix and an MSE-based regularizer are effective strategies to alleviate ML-CLL problems.
Trade-off parameter . Table V reports the performance of MLCL with varying values that trade-off the complementary loss function and an MSE-based regularization . Here, average precision is regarded as the criterion, and the training data is with uniform complementary labels. is selected from the candidate value list . We can observe the best results of most datasets is achieved at and the performance drops when takes a smaller value. In general, a relatively large usually leads to better performance than a small value. Therefore, we set for MLCL.
6.5 Combination of Complementary Labels and Relevant Labels
Setup. Finally, we demonstrate the effectiveness of combining relevant labeled data and complementary labeled one. The training data is associated with uniform complementary labels and relevant labels simultaneously. More specifically, an instance is associated with a complementary label and relevant labels , where is uniformly selected and is randomly selected from the relevant label set of (i.e., ). Here, we set that means each instance only associated with a complementary label and a relevant label. The other experimental settings are the same with Subsection 5.2.
Results. We compare three methods: (1) the “Fully supervised” method uses the linear model to train with the fully supervised data, which is fully supervised MLL; (2) the “CL” method refers to MLCL training with the uniform complementary-label data; (3) the combination (“CL & RL”) method adopts the linear model with the loss function Eq.(12) to train, where the training data is equipped with the combination of complementary labels and relevant labels. Table VI reports the experimental results on five criteria. We can see that the performance of “CL& RL” method is much superior to “CL” method on all datasets over hamming loss, ranking loss, one error, coverage and average precision, such as “CL& RL” method outperforms “CL” method by a large margin over average precision (+0.436 on eurlex_dc and +0.425 on eurlex_sm). This demonstrates that the ML-CLL is easily applied to fully supervised MLL scenarios, MLL with missing labels [49, 50] or other MLL scenarios. Moreover, “CL & RL” method achieves comparable performance to “Fully supervised” method, which illustrates that ML-CLL can get excellent results just via increasing a few additional information. This is useful for application in the real world, because ML-CLL can obtain good performance through less expensive labeled data.
7 Conclusion
In this paper, we theoretically analyze the reason causing why the estimated transition matrix in multi-class CLL is distorted in ML-CLL. To alleviate the pain in directly calculating the transition matrix from complex label correlations under multi-labeled data is unknown, we propose a two-step method to estimate the transition matrix in ML-CLL, which adopts label correlations to correct an initial transition matrix. Furthermore, we theoretically show that the proposed approach is classifier-consistent. Additionally, due to MSE loss achieving a prominent robust, an MSE-based regularizer is introduced to alleviate the tendency of the fast convergent BCE loss overfitting to noises. Finally, we show that our proposed ML-CLL can be easily combined with relevant labels and the proposed method can achieve a comparable performance to fully supervised MLL through a few additional information.
Appendix A The Proof of Theorem 1
Theorem 1. Given an instance , suppose is the relevant label set and the label is the complementary label which is randomly selected. Then the following equality holds:
[TABLE]
Proof.
Firstly, we should introduce addition rule of probability: , so we have . We start to prove the above inequlity. According to the assumption: , we have
[TABLE]
According to addition rule of probability, so we have
[TABLE]
∎
Appendix B The Proof of Theorem 3
Theorem 3. Under a MLL scenario: suppose the labels () are dependent, and the labels belonging to are mutually exclusive. For any , its label set and . Let the label () be the complementary label of . and calculated from label correlations satisfy
[TABLE]
where denotes the integer set . The difference of and on the complementary label is
[TABLE]
where .
Proof.
We start calculating the difference from estimating the transition probabilities and . According to Definition 2 and the description of Theorem 3, we have
[TABLE]
Based on the assumption of that and are conditionally independent given , then we can have
[TABLE]
Since and do not hold according to the definition of the transition matrix, and then we can obtain
[TABLE]
[TABLE]
Similarly, we can get
[TABLE]
Next, we calculate the difference . The rest elements of are same as that estimated by multi-class CLL. According the definition of , we have
[TABLE]
Because and , is defined as , the above inequation holds. ∎
Appendix C The Proof of Corollary 4
Corollary 4. Under a MLL scenario: there are () labels that are dependent, while the labels belong to are mutually exclusive. For any , its relevant set and . Suppose the label is the complementary label of . The difference between and has
[TABLE]
where .
Proof.
Here, we apply induction to get the difference as increases. We start by computing the difference in the case of . Suppose class labels are dependent, while the rest of labels in the label space are mutually exclusive. is associated with and . Then we calculate transition probabilities in from label correlations according to Theorem 3 as:
[TABLE]
[TABLE]
and use the same way to estimate. Due to , let , we can obtain
[TABLE]
Similarly, we can compute . Then the difference is
[TABLE]
Similarly, for any , suppose class labels are strongly dependent, while the rest of labels in the label space are mutually exclusive. is associated with and . Then we calculate transition probabilities from label correlations:
[TABLE]
[TABLE]
As discussed above, since . By the same calculation way, we can obtain . Based on induction, we can summarize the difference . ∎
Appendix D The Proof of Theorem 6
Theorem 6. With Assumption 5, suppose the transition matrix is invertible, then the ML-CLL optimal classifier converges to the MLL optimal classifier , i.e., .
Proof.
We prove is also the optimal classifier for ML-CLL via substituting into the ML-CLL risk:
[TABLE]
According to the proof of [8], . So we find the optimal ensuring when the transition matrix is invertible and Assumption 5 is satisfied. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algorithms,” IEEE Trans. Knowl. Data Eng. , vol. 26, no. 8, pp. 1819–1837, 2014.
- 2[2] M.-L. Zhang and L. Wu, “Lift: Multi-label learning with label-specific features,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 37, no. 1, pp. 107–120, 2015.
- 3[3] T. N. Rubin, A. Chambers, P. Smyth, and M. Steyvers, “Statistical topic models for multi-label document classification,” Mach. Learn. , vol. 88, no. 1-2, pp. 157–208, 2012.
- 4[4] P.-J. Tang, M. Jiang, B. N. Xia, J. W. Pitera, J. Welser, and N. V. Chawla, “Multi-label patent categorization with non-local attention-based graph convolutional network,” in Proceedings of the 34th Conference on Artificial Intelligence , York, NY, 2020, pp. 9024–9031.
- 5[5] A. Lambrecht and C. Tucker, “When does retargeting work? information specificity in online advertising,” Journal of Marketing research , vol. 50, no. 5, pp. 561–576, 2013.
- 6[6] T. Ishida, G. Niu, W.-H. Hu, and M. Sugiyama, “Learning from complementary labels,” in Advances in Neural Information Processing Systems 30 , Long Beach, CA, 2017, pp. 5639–5649.
- 7[7] T. Ishida, G. Niu, A. K. Menon, and M. Sugiyama, “Complementary-label learning for arbitrary losses and models,” in Proceedings of the 36th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, Long Beach, CA, 2019, pp. 2971–2980.
- 8[8] X.-Y. Yu, T.-L. Liu, M.-M. Gong, and D.-C. Tao, “Learning with biased complementary labels,” in Proceedings of the 15th European Conference on Computer Vision , Munich, Germany, 2018, pp. 69–85.
