Correlated Logistic Model With Elastic Net Regularization for Multilabel Image Classification
Qiang Li, Bo Xie, Jane You, Wei Bian, Dacheng Tao

TL;DR
This paper introduces CorrLog, a correlated logistic model with elastic net regularization for multilabel image classification, explicitly modeling label correlations and improving performance on benchmark datasets.
Contribution
The paper proposes CorrLog, a novel multilabel classification model that incorporates label correlations and uses elastic net regularization for feature and label sparsity.
Findings
CorrLog achieves competitive results on benchmark datasets.
The model's generalization bound is independent of label count.
Elastic net regularization enhances feature and label selection.
Abstract
In this paper, we present correlated logistic (CorrLog) model for multilabel image classification. CorrLog extends conventional logistic regression model into multilabel cases, via explicitly modeling the pairwise correlation between labels. In addition, we propose to learn the model parameters of CorrLog with elastic net regularization, which helps exploit the sparsity in feature selection and label correlations and thus further boost the performance of multilabel classification. CorrLog can be efficiently learned, though approximately, by regularized maximum pseudo likelihood estimation, and it enjoys a satisfying generalization bound that is independent of the number of labels. CorrLog performs competitively for multilabel image classification on benchmark data sets MULAN scene, MIT outdoor scene, PASCAL VOC 2007, and PASCAL VOC 2012, compared with the state-of-the-art multilabel…
| Notation | Description |
|---|---|
| training dataset with examples, | |
| modified training data set by replacing the -th example of with an independent example | |
| modified training data set by discarding the -th example of | |
| negative log pseudo likelihood over training dataset | |
| regularized negative log pseudo likelihood over training dataset | |
| elastic net regularization with weights , and parameter | |
| model parameters of CorrLog | |
| empirical learned model parameters by maximum pseudo likelihood estimation over | |
| empirical learned model parameters over | |
| empirical learned model parameters over | |
| empirical error of the empirical model over training set | |
| generalization error of the empirical model |
| Datasets | #images | #features | #labels |
|---|---|---|---|
| MULANscene | 2047 | 294 | 6 |
| MITscene-PHOW | 2688 | 3600 | 8 |
| MITscene-CNN | 2688 | 4096 | 8 |
| PASCAL07-PHOW | 9963 | 3600 | 20 |
| PASCAL07-CNN | 9963 | 4096 | 20 |
| PASCAL12-PHOW | 11540 | 3600 | 20 |
| PASCAL12-CNN | 11540 | 4096 | 20 |
| MITscene Images and Tags | ||||||||||||||||||||||||
| coast | forest | highway | insidecity | mountain | opencountry | street | tallbuilding | |||||||||||||||||
|
|
|
|
|
|
|
|
|||||||||||||||||
| Learned CorrLog Label Graph | ||||||||||||||||||||||||
| regularization | Elastic net regularization | |||||||||||||||||||||||
| Datasets | Methods | Measures | |||||
|---|---|---|---|---|---|---|---|
| Hamming loss | 0-1 loss | Accuracy | F1-Score | Macro-F1 | Micro-F1 | ||
| MULANscene | CorrLog | 0.0950.007 | 0.3410.020 | 0.7100.018 | 0.7280.017 | 0.7450.016 | 0.7340.017 |
| ILRs | 0.1170.006 | 0.4950.022 | 0.5920.016 | 0.6220.014 | 0.6770.016 | 0.6690.014 | |
| IBLR | 0.0850.004 | 0.3580.016 | 0.6770.018 | 0.6890.019 | 0.7470.010 | 0.7380.014 | |
| MLkNN | 0.0860.003 | 0.3740.015 | 0.6680.018 | 0.6820.019 | 0.7420.013 | 0.7340.012 | |
| CC | 0.1040.005 | 0.3460.015 | 0.6960.015 | 0.7100.015 | 0.7160.018 | 0.7060.014 | |
| MMOC | 0.1260.017 | 0.4010.046 | 0.6290.049 | 0.6390.050 | 0.6800.031 | 0.6380.049 | |
| Datasets | Methods | Measures | |||||
|---|---|---|---|---|---|---|---|
| Hamming loss | 0-1 loss | Accuracy | F1-Score | Macro-F1 | Micro-F1 | ||
| MITscene-PHOW | CorrLog | 0.0450.006 | 0.1960.017 | 0.8840.012 | 0.9140.010 | 0.8830.017 | 0.9150.011 |
| ILRs | 0.0710.002 | 0.3580.015 | 0.8250.007 | 0.8770.005 | 0.8330.007 | 0.8720.003 | |
| IBLR | 0.0600.003 | 0.2430.021 | 0.8450.012 | 0.8790.008 | 0.8480.009 | 0.8860.006 | |
| MLkNN | 0.0690.002 | 0.3260.022 | 0.8100.009 | 0.8570.006 | 0.8270.009 | 0.8690.004 | |
| CC | 0.0470.005 | 0.2000.021 | 0.8830.012 | 0.9130.008 | 0.8830.015 | 0.9130.009 | |
| MMOC | 0.0620.010 | 0.2740.035 | 0.8450.017 | 0.8850.014 | 0.8460.024 | 0.8850.017 | |
| MITscene-CNN | CorrLog | 0.0170.004 | 0.0880.015 | 0.9530.008 | 0.9660.006 | 0.9570.011 | 0.9680.006 |
| ILRs | 0.0200.002 | 0.1020.015 | 0.9470.006 | 0.9620.004 | 0.9510.007 | 0.9630.005 | |
| IBLR | 0.0220.001 | 0.0900.009 | 0.9440.004 | 0.9570.003 | 0.9440.004 | 0.9580.003 | |
| MLkNN | 0.0240.002 | 0.1040.005 | 0.9390.003 | 0.9540.003 | 0.9410.002 | 0.9550.004 | |
| CC | 0.0210.003 | 0.0750.008 | 0.9510.005 | 0.9620.004 | 0.9480.007 | 0.9610.005 | |
| MMOC | 0.0180.002 | 0.0620.005 | 0.9590.003 | 0.9670.003 | 0.9550.005 | 0.9670.004 | |
| Datasets | Methods | Measures | |||||
|---|---|---|---|---|---|---|---|
| Hamming loss | 0-1 loss | Accuracy | F1-Score | Macro-F1 | Micro-F1 | ||
| PASCAL07-PHOW | CorrLog | 0.0680.001 | 0.7760.007 | 0.3700.010 | 0.4230.012 | 0.3670.011 | 0.4800.008 |
| ILRs | 0.0930.001 | 0.8780.007 | 0.2940.008 | 0.3600.009 | 0.3320.008 | 0.4040.007 | |
| IBLR | 0.0660.001 | 0.8320.003 | 0.2700.005 | 0.3080.006 | 0.2580.007 | 0.4080.009 | |
| MLkNN | 0.0660.001 | 0.8390.006 | 0.2560.007 | 0.2910.008 | 0.2350.006 | 0.3920.007 | |
| CC | 0.0910.000 | 0.8450.010 | 0.3180.005 | 0.3790.003 | 0.3480.004 | 0.4170.001 | |
| MMOC | 0.0650.001 | 0.8500.003 | 0.2590.009 | 0.2990.011 | 0.2060.007 | 0.3920.012 | |
| PASCAL07-CNN | CorrLog | 0.0380.001 | 0.5160.010 | 0.6420.010 | 0.6960.010 | 0.6740.002 | 0.7240.006 |
| ILRs | 0.0460.001 | 0.5740.011 | 0.6100.010 | 0.6730.009 | 0.6510.004 | 0.6880.007 | |
| IBLR | 0.0430.001 | 0.5540.011 | 0.5970.014 | 0.6490.015 | 0.6210.007 | 0.6820.010 | |
| MLkNN | 0.0430.001 | 0.5570.010 | 0.5850.014 | 0.6350.015 | 0.6130.006 | 0.6680.011 | |
| CC | 0.0510.001 | 0.5860.008 | 0.6020.008 | 0.6680.008 | 0.6350.009 | 0.6690.008 | |
| MMOC | 0.0370.000 | 0.5120.008 | 0.6340.009 | 0.6840.009 | 0.6630.005 | 0.7190.004 | |
| Datasets | Methods | Measures | |||||
|---|---|---|---|---|---|---|---|
| Hamming loss | 0-1 loss | Accuracy | F1-Score | Macro-F1 | Micro-F1 | ||
| PASCAL12-PHOW | CorrLog | 0.0700.001 | 0.7900.009 | 0.3440.009 | 0.3930.010 | 0.3690.014 | 0.4490.006 |
| ILRs | 0.1000.001 | 0.8910.009 | 0.2690.007 | 0.3330.008 | 0.3240.008 | 0.3700.005 | |
| IBLR | 0.0680.001 | 0.8690.009 | 0.2190.005 | 0.2520.003 | 0.2530.007 | 0.3450.005 | |
| MLkNN | 0.0690.001 | 0.8830.008 | 0.1910.006 | 0.2180.005 | 0.2130.007 | 0.3060.006 | |
| CC | 0.0970.001 | 0.8620.012 | 0.2910.010 | 0.3500.010 | 0.3400.007 | 0.3800.006 | |
| MMOC | 0.0670.001 | 0.8650.003 | 0.2270.005 | 0.2620.007 | 0.2000.007 | 0.3460.004 | |
| PASCAL12-CNN | CorrLog | 0.0400.001 | 0.5260.010 | 0.6390.007 | 0.6950.007 | 0.6740.006 | 0.7080.006 |
| ILRs | 0.0510.001 | 0.6130.002 | 0.5810.005 | 0.6490.006 | 0.6380.005 | 0.6580.005 | |
| IBLR | 0.0450.001 | 0.5740.006 | 0.5750.009 | 0.6270.010 | 0.6130.008 | 0.6570.006 | |
| MLkNN | 0.0450.002 | 0.5750.012 | 0.5660.015 | 0.6160.017 | 0.6040.011 | 0.6450.013 | |
| CC | 0.0550.001 | 0.6150.010 | 0.5790.009 | 0.6470.010 | 0.6230.005 | 0.6430.007 | |
| MMOC | 0.0390.001 | 0.5250.005 | 0.6190.006 | 0.6690.007 | 0.6590.004 | 0.6990.005 | |
| Methods | Train | Test per image |
|---|---|---|
| CorrLog | ||
| ILRs | ||
| IBLR | ||
| MLkNN | ||
| CC | ||
| MMOC |
| MULANscene | MITscene-PHOW | MITscene-CNN | PASCAL07-PHOW | PASCAL07-CNN | PASCAL12-PHOW | PASCAL12-CNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | |
| CorrLog | 0.09 | 1.74 | 2.80 | 2.12 | 2.46 | 2.08 | 8.94 | 10.68 | 8.35 | 11.08 | 9.67 | 12.58 | 8.62 | 13.06 |
| ILRs | 2.54 | 0.02 | 39.50 | 0.37 | 7.50 | 0.15 | 872.77 | 4.79 | 122.73 | 1.56 | 1183.45 | 5.59 | 161.71 | 1.83 |
| IBLR | 12.01 | 2.63 | 218.98 | 53.31 | 215.28 | 52.19 | 3132.18 | 779.15 | 2833.94 | 688.53 | 4142.75 | 1034.86 | 3824.06 | 947.90 |
| MLkNN | 10.29 | 2.36 | 188.08 | 45.87 | 176.52 | 42.78 | 2507.51 | 628.61 | 2232.19 | 551.26 | 3442.21 | 863.17 | 3020.29 | 779.50 |
| CC | 5.48 | 0.06 | 40.71 | 0.55 | 26.64 | 0.65 | 74315.65 | 7.48 | 8746.82 | 8.15 | 137818.99 | 8.38 | 15926.62 | 9.57 |
| MMOC | 851.98 | 0.51 | 2952.77 | 0.70 | 2162.13 | 0.48 | 86714.47 | 33.08 | 38403.54 | 17.75 | 97856.16 | 31.43 | 45541.01 | 20.66 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Text and Document Classification Technologies · Advanced Image and Video Retrieval Techniques
MethodsLogistic Regression
Correlated Logistic Model with Elastic Net Regularization for Multilabel Image Classification
Qiang Li, Bo Xie, Jane You, Wei Bian, and Dacheng Tao, This research was supported in part by Australian Research Council Projects FT-130101457, DP-140102164 and LE-140100061. The funding support from the Hong Kong government under its General Research Fund (GRF) scheme (Ref. no. 152202/14E) and the Hong Kong Polytechnic University Central Research Grant is greatly appreciated.Q. Li is with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology Sydney, 81 Broadway, Ultimo, NSW 2007, Australia, and also with Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong (e-mail: [email protected]).W. Bian and D. Tao are with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology Sydney, 81 Broadway, Ultimo, NSW 2007, Australia (e-mail: [email protected], [email protected]).B. Xie is with College of Computing, Georgia Institute of Technology, Atlanta, GA 30345, USA (email: [email protected]).J. You is with Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong (e-mail: [email protected]).ⓒ20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract
In this paper, we present correlated logistic model (CorrLog) for multilabel image classification. CorrLog extends conventional Logistic Regression model into multilabel cases, via explicitly modelling the pairwise correlation between labels. In addition, we propose to learn model parameters of CorrLog with Elastic Net regularization, which helps exploit the sparsity in feature selection and label correlations and thus further boost the performance of multilabel classification. CorrLog can be efficiently learned, though approximately, by regularized maximum pseudo likelihood estimation (MPLE), and it enjoys a satisfying generalization bound that is independent of the number of labels. CorrLog performs competitively for multilabel image classification on benchmark datasets MULAN scene, MIT outdoor scene, PASCAL VOC 2007 and PASCAL VOC 2012, compared to the state-of-the-art multilabel classification algorithms.
Index Terms:
Correlated logistic model, elastic net, multilabel classification.
I Introduction
Multilabel classification (MLC) extends conventional single label classification (SLC) by allowing an instance to be assigned to multiple labels from a label set. It occurs naturally from a wide range of practical problems, such as document categorization, image classification, music annotation, webpage classification and bioinformatics applications, where each instance can be simultaneously described by several class labels out of a candidate label set. MLC is also closely related to many other research areas, such as subspace learning [1], nonnegative matrix factorization [2], multi-view learning [3] and multi-task learning [4]. Because of its great generality and wide applications, MLC has received increasing attentions in recent years from machine learning, data mining, to computer vision communities, and developed rapidly with both algorithmic and theoretical achievements [5, 6, 7, 8, 9, 10].
The key feature of MLC that makes it distinct from SLC is label correlation, without which classifiers can be trained independently for each individual label and MLC degenerates to SLC. The correlation between different labels can be verified by calculating the statistics, e.g., test and Pearson’s correlation coefficient, of their distributions. According to [11], there are two types of label correlations (or dependence), i.e., the conditional correlations and the unconditional correlations, wherein the former describes the label correlations conditioned on a given instance while the latter summarizes the global label correlations of only label distribution by marginalizing out the instance. From a classification point of view, modelling of label conditional correlations is preferable since they are directly related to prediction; however, proper utilization of unconditional correlations is also helpful, but in an average sense because of the marginalization. Accordingly, quite a number of MLC algorithms have been proposed in the past a few years, by exploiting either of the two types of label correlations,111Studies on MLC, from different perspectives rather than label correlations, also exit in the literature, e.g., by defining different loss functions, dimension reduction and classifier ensemble methods, but are not in the scope of this paper. and below, we give a brief review of the representative ones. As it is a very big literature, we cannot cover all the algorithms. The recent surveys [8, 9] contain many references omitted from this paper.
- •
By exploiting unconditional label correlations: A large class of MLC algorithms that utilize unconditional label correlations are built upon label transformation. The key idea is to find new representation for the label vector (one dimension corresponds to an individual label), so that the transformed labels or responses are uncorrelated and thus can be predicted independently. Original label vector needs to be recovered after the prediction. MLC algorithms using label transformation include [12] which utilizes low-dimensional embedding and [7] and [13] which use random projections. Another strategy of using unconditional label correlations, e.g., used in the stacking method [6] and the “Curds” & “Whey” procedure [14], is first to predict each individual label independently and correct/adjust the prediction by proper post-processing. Algorithms are also proposed based on co-occurrence or structure information extracted from the label set, which include random -label sets (RAKEL) [15], pruned problem transformation (PPT) [16], hierarchical binary relevance (HBR) [17] and hierarchy of multilabel classifiers (HOMER) [8]. Regression-based models, including reduced-rank regression and multitask learning, can also be used for MLC, with an interpretation of utilizing unconditional label correlations [11].
- •
By exploiting conditional label correlations: MLC algorithms in this category are diverse and often developed by specific heuristics. For example, multilabel -nearest neighbour (MLkNN) [5] extends KNN to the multilabel situation, which applies maximum a posterior (MAP) label prediction by obtaining the prior label distribution within the nearest neighbours of an instance. Instance-based logistic regression (IBLR) [6] is also a localized algorithm, which modifies logistic regression by using label information from the neighbourhood as features. Classifier chain (CC) [18], as well as its ensemble and probabilistic variants [19], incorporate label correlations into a chain of binary classifiers, where the prediction of a label uses previous labels as features. Channel coding based MLC techniques such as principal label space transformation (PLST) [20] and maximum margin output coding (MMOC) [21] proposed to select codes that exploits conditional label correlations. Graphical models, e.g., conditional random fields (CRFs) [22], are also applied to MLC, which provides a richer framework to handle conditional label correlations.
I-A Multilabel Image Classification
Multilabel image classification belongs to the generic scope of MLC, but handles the specific problem of predicting the presence or absence of multiple object categories in an image. Like many related high-level vision tasks such as object recognition [23, 24], visual tracking [25], image annotation [26, 27, 28] and scene classification [29, 30, 31], multilabel image classification [32, 33, 34, 35, 36, 37] is very challenging due to large intra-class variation. In general, the variation is caused by viewpoint, scale, occlusion, illumination, semantic context, etc.
On the one hand, many effective image representation schemes have been developed to handle this high-level vision task. Most of the classical approaches derive from handcrafted image features, such as GIST [38], dense SIFT [39], VLAD [40], and object bank [41]. In contrast, the very recent deep learning techniques have also been developed for image feature learning, such as deep CNN features [42, 43]. These techniques are more powerful than classical methods when learning from a very large amount of unlabeled images.
On the other hand, label correlations have also been exploited to significantly improve image classification performance. Most of the current multilabel image classification algorithms are motivated by considering label correlations conditioned on image features, thus intrinsically falls into the CRFs framework. For example, probabilistic label enhancement model (PLEM) [44] designed to exploit image label co-occurrence pairs based on a maximum spanning tree construction and a piecewise procedure is utilized to train the pairwise CRFs model. More recently, clique generating machine (CGM) [45] proposed to learn the image label graph structure and parameters by iteratively activating a set of cliques. It also belongs to the CRFs framework, but the labels are not constrained to be all connected which may result in isolated cliques.
I-B Motivation and Organization
Correlated logistic model (CorrLog) provides a more principled way to handle conditional label correlations, and enjoys several favourable properties: 1) built upon independent logistic regressions (ILRs), it offers an explicit way to model the pairwise (second order) label correlations; 2) by using the pseudo likelihood technique, the parameters of CorrLog can be learned approximately with a computational complexity linear with respect to label number; 3) the learning of CorrLog is stable, and the empirically learned model enjoys a generalization error bound that is independent of label number. In addition, the results presented in this paper extend our previous study [46] in following aspects: 1) we introduce elastic net regularization to CorrLog, which facilitates the utilization of the sparsity in both feature selection and label correlations; 2) a learning algorithm for CorrLog based on soft thresholding is derived to handle the nonsmoothness of the elastic net regularization; 3) the proof of generalization bound is also extended for the new regularization; 4) we apply CorrLog to multilabel image classification, and achieve competitive results with the state-of-the-art methods of this area.
To ease the presentation, we first summarize the important notations in Table I. The rest of this paper is organized as follows. Section II introduces the model CorrLog with elastic net regularization. Section III presents algorithms for learning CorrLog by regularized maximum pseudo likelihood estimation, and for prediction with CorrLog by message passing. A generalization analysis of CorrLog based on the concept of algorithm stability is presented in Section IV. Section V to Section VII report results of empirical evaluations, including experiments on synthetic dataset and on benchmark multilabel image classification datasets.
II Correlated Logistic Model
We study the problem of learning a joint prediction , where the instance space and the label space . By assuming the conditional independence among labels, we can model MLC by a set of independent logistic regressions (ILRs). Specifically, the conditional probability of ILRs is given by
[TABLE]
where is the coefficients for the -th logistic regression (LR) in ILRs. For the convenience of expression, the bias of the standard LR is omitted here, which is equivalent to augmenting the feature of with a constant.
Clearly, ILRs (1) enjoys several merits, such as, it can be learned efficiently, in particular with a linear computational complexity with respect to label number , and its probabilistic formulation inherently helps deal with the imbalance of positive and negative examples for each label, which is a common problem encountered by MLC. However, it ignores entirely the potential correlation among labels and thus tends to under-fit the true posterior , especially when the label number is large.
II-A Correlated Logistic Regressions
CorrLog tries to extend ILRs with as small effort as possible, so that the correlation among labels is explicitly modelled while the advantages of ILRs can be also preserved. To achieve this, we propose to augment (1) with a simple function and reformulate the posterior probability as
[TABLE]
As long as cannot be decomposed into independent product terms for individual labels, it introduces label correlations into . It is worth noticing that we assumed to be independent of . Therefore, (2) models label correlations in an average sense. This is similar to the concept of “marginal correlations” in MLC [11]. However, they are intrinsically different, because (2) integrate the correlation into the posterior probability, which directly aims at prediction. In addition, the idea used in (2) for correlation modelling is also distinct from the “Curds and Whey” procedure in [14] which corrects outputs of multivariate linear regression by reconsidering their correlations to the true responses.
In this paper, we choose to be the following quadratic form,
[TABLE]
It means that and are positively correlated given and negatively correlated given . It is also possible to define as functions of , but this will drastically increase the number of model parameters, e.g., by if linear functions are used.
By substituting (3) into (2), we obtain the conditional probability for CorrLog
[TABLE]
where the model parameter contains and . It can be seen that CorrLog is a simple modification of (1), by using a quadratic term to adjust the joint prediction, so that hidden label correlations can be exploited. In addition, CorrLog is closely related to popular statistical models for joint modelling of binary variables. For example, conditional on , (4) is exactly an Ising model [47] for . It can also be treated as a special instance of CRFs [22], by defining features and . Moreover, classical model multivariate probit (MP) [48] also models pairwise correlations in . However, it utilizes Gaussian latent variables for correlation modelling, which is essentially different from CorrLog.
II-B Elastic Net Regularization
Given a set of training data , CorrLog can be learned by regularized maximum log likelihood estimation (MLE), i.e.,
[TABLE]
where is the negative log likelihood
[TABLE]
and is a properly chosen regularization.
A possible choice for is the regularizer,
[TABLE]
with , being the weighting parameters. The regularization enjoys the merits of computational flexibility and learning stability. However, it is unable to exploit any sparsity that can be possessed by the problem at hand. For example, for MLC, it is likely that the prediction of each label only depends on a subset of the features of , which implies the sparsity of . Besides, can also be sparse since not all labels in are correlated to each other. regularizer is another choice for , especially regarding model sparsity. Nevertheless, it has been noticed by several studies that regularized algorithms are inherently unstable, that is, a slight change of the training data set can lead to substantially different prediction models. Based on above consideration, we propose to use the elastic net regularizer [49], which is a combination of and regularizers and inherits their individual advantages, i.e., learning stability and model sparsity,
[TABLE]
where controls the trade-off between the regularization and the regularization, and large encourages a high level of sparsity.
III Algorithms
In this section, we derive algorithms for learning and prediction with CorrLog. The exponentially large size of the label space makes exact algorithms for CorrLog computationally intractable, since the conditional probability (4) needs to be normalized by the partition function
[TABLE]
which is a summation over an exponential number of terms. Thus, we turn to approximate learning and prediction algorithms, by exploiting the pseudo likelihood and the message passing techniques.
III-A Approximate Learning via Pseudo Likelihood
Maximum pseudo likelihood estimation (MPLE) [50] provides an alternative approach for estimating model parameters, especially when the partition function of the likelihood cannot be evaluated efficiently. It was developed in the field of spatial dependence analysis and has been widely applied to the estimation of various statistical models, from the Ising model [47] to the CRFs [51]. Here, we apply MPLE to the learning of parameter in CorrLog.
The pseudo likelihood of the model over jointly distributed random variables is defined as the product of the conditional probability of each individual random variables conditioned on all the rest ones. For CorrLog (4), its pseudo likelihood is given by
[TABLE]
where and the conditional probability can be directly obtained from (4),
[TABLE]
Accordingly, the negative log pseudo likelihood over the training data is given by
[TABLE]
To this end, the optimal model parameter of CorrLog can be learned approximately by the elastic net regularized MPLE,
[TABLE]
where , and are tuning parameters.
A First-Order Method by Soft Thresholding: Problem (III-A) is a convex optimization problem, thanks to the convexity of the logarithmic loss function and the elastic net regularization, and thus a unique optimal solution. However, the elastic net regularization is non-smooth due to the norm regularizer, which makes direct gradient based algorithm inapplicable. The main idea of our algorithm for solving (III-A) is to divide the objective function into smooth and non-smooth parts, and then apply the soft thresholding technique to deal with the non-smoothness.
Denoting by the smooth part of , i.e.,
[TABLE]
its gradient at the -th iteration is given by
[TABLE]
with
[TABLE]
Then, a surrogate of the objective function in (III-A) can be obtained by using , i.e.,
[TABLE]
The parameter in (III-A) servers a similar role to the variable updating step size in gradient descent methods, and it is set such that is larger than the Lipschitz constant of . For such , it can be shown that and . Therefore, the update of can be realized by the minimization
[TABLE]
which is solved by the soft thresholding function , i.e.,
[TABLE]
where
[TABLE]
Iteratively applying (19) until convergence provides a first-order method for solving (III-A). Algorithm 1 presents the pseudo code for this procedure.
Remark 1 From the above derivation, especially equations (15) and (19), the computational complexity of our learning algorithm is linear with respect to the label number . Therefore, learning CorrLog is no more expensive than learning independent logistic regressions, which makes CorrLog scalable to the case of large label numbers.
Remark 2 It is possible to further speed up the learning algorithm. In particular, Algorithm 1 can be modified to have the optimal convergence rate in the sense of Nemirovsky and Yudin [52], i.e., wherein is the number of iterations. However, its convergence is usually as slow as in standard gradient descent methods. Actually, we only need to replace the current variable in the surrogate (III-A) by a weighted combination of the variables from previous iterations. As such modification is a direct application of the fast iterative shrinkage thresholding, [53], we do not present the details here but leave readers to the reference.
III-B Joint Prediction by Message Passing
For MLC, as the labels are not independent in general, the prediction task is actually a joint maximum a posterior (MAP) estimation over . In the case of CorrLog, suppose the model parameter is learned by the regularized MPLE from the last subsection, the prediction of for a new instance can be obtained by
[TABLE]
We use the belief propagation (BP) to solve (III-B) [54]. Specifically, we run the max-product algorithm with uniformly initialized messages and an early stopping criterion with 50 iterations. Since the graphical model defined by in (III-B) has loops, we cannot guarantee the convergence of the algorithm. However, we found that it works well on all experiments in this paper.
IV Generalization Analysis
An important issue in designing a machine learning algorithm is generalization, i.e., how the algorithm will perform on the test data compared to on the training data. In the section, we present a generalization analysis for CorrLog, by using the concept of algorithmic stability [55]. Our analysis follows two steps. First, we show that the learning of CorrLog by MPLE is stable, i.e., the learned model parameter does not vary much given a slight change of the training data set . Then, we prove that the generalization error of CorrLog can be bounded by the empirical error, plus a term related to the stability but independent of the label number .
IV-A The Stability of MPLE
The stability of a learning algorithm indicates how much the learned model changes according to a small change of the training data set. Denote by a modified training data set the same with but replacing the -th training example by another independent example . Suppose and are the model parameters learned by MPLE (III-A) on and , respectively. We intend to show that the difference between these two models, defined as
[TABLE]
is bounded by an order of , so that the learning is stable for large .
First, we need the following auxiliary model learned on , which is the same with but without the -th example
[TABLE]
where
[TABLE]
The following Lemma provides an upper bound of the difference .
Lemma 1**.**
Given and defined in (III-A), and defined in (26), it holds for ,
[TABLE]
Proof.
Denote by RHS the righthand side of (1), we have
[TABLE]
Furthermore, the definition of implies . Combining these two we have (1). This completes the proof. ∎
Next, we show a lower bound of the difference .
Lemma 2**.**
Given and defined in (III-A), and defined in (26), it holds for ,
[TABLE]
Proof.
We define the following function
[TABLE]
Then, for (29), it is sufficient to show that . By using (III-A), we have
[TABLE]
It is straightforward to verify that and in (III-A) have the same subgradient at , i.e.,
[TABLE]
Since minimizes , we have and thus , which implies also minimizes . Therefore . ∎
In addition, by checking the Lipschitz continuous property of , we have the following Lemma 3.
Lemma 3**.**
Given defined in (III-A) and defined in (26), it holds for and
[TABLE]
Proof.
First, we have
[TABLE]
and
[TABLE]
That is is Lipschitz continuous with respect to and , with constant and , respectively. Therefore, (3) holds. ∎
By combining the above three Lemmas, we have the following Theorem 1 that shows the stability of CorrLog.
Theorem 1**.**
Given model parameters and learned on training datasets and , respectively, both by (III-A), it holds that
[TABLE]
Proof.
By combining (1), (29) and (3), we have
[TABLE]
Further, by using
[TABLE]
we have
[TABLE]
Since and differ from each other with only the -th training example, the same argument gives
[TABLE]
Then, (33) is obtained immediately. This completes the proof. ∎
IV-B Generalization Bound
We first define a loss function to measure the generalization error. Considering that CorrLog predicts labels by MAP estimation, we define the loss function by using the log probability
[TABLE]
where the constant and
[TABLE]
The loss function (38) is defined analogously to the loss function used in binary classification, where is replaced with the margin if a linear classifier is used. Besides, (38) gives a 0 loss only if all dimensions of are correctly predicted, which emphasizes the joint prediction in MLC. By using this loss function, the generalization error and the empirical error are given by
[TABLE]
and
[TABLE]
According to [55], an exponential bound exists for if CorrLog has a uniform stability with respect to the loss function (38). The following Theorem 2 shows this condition holds.
Theorem 2**.**
Given model parameters and learned on training datasets and , respectively, both by (III-A), it holds for ,
[TABLE]
Proof.
First, we have the following inequality from (38)
[TABLE]
Then, by introducing notation
[TABLE]
and rewriting
[TABLE]
we have
[TABLE]
Due to the fact that for any functions and it holds222 Suppose and maximize and respectively, and without loss of generality , we have .
[TABLE]
we have
[TABLE]
Then, the proof is completed by applying Theorem 1. ∎
Now, we are ready to present the main theorem on the generalization ability of CorrLog.
Theorem 3**.**
Given the model parameter learned by (III-A), with i.i.d. training data and regularization parameters , , it holds with at least probability ,
[TABLE]
Proof.
Given Theorem 2, the generalization bound (3) is a direct result of Theorem 12 in [55] (Please refer to the reference for details). ∎
Remark 3 A notable observation from Theorem 3 is that the generalization bound (3) of CorrLog is independent of the label number . Therefore, CorrLog is preferable for MLC with a large number of labels, for which the generalization error still can be bounded with high confidence.
Remark 4 While the learning of CorrLog (III-A) utilizes the elastic net regularization , where is the weighting parameter on the regularization to encourage sparsity, the generalization bound (3) is independent of the parameter . The reason is that regularization does not lead to stable learning algorithms [56], and only the regularization in contributes to the stability of CorrLog.
V Toy Example
We design a simple toy example to illustrate the capacity of CorrLog on label correlation modelling. In particular, we show that when ILRs fail drastically due to ignoring the label correlations (under-fitting), CorrLog performs well. Consider a two-label classification problem on a 2-D plane, where each instance is sampled uniformly from the unit disc and the corresponding labels are defined by
[TABLE]
where , and the augmented feature is . The function takes value or , and the operation outputs if either of its input is . The definition of makes the two labels correlated. We generate 1000 random examples according to above setting and split them into training and test sets, each of which contains 500 examples. During training, we set the parameter of the elastic net regularization to 0, i.e., we actually used an regularization, this is because in this example the model is not sparse in terms of both feature selection and label correlation. In addition, as the number of the training examples is sufficiently large for this problem, we suppose there is no over-fitting and tune the regularization parameters for both ILRs and CorrLog by minimizing the 0-1 loss on the training set.
Figure 1 shows that true labels of test data, the predictions of ILRs and the predictions of CorrLog, where different labels are marked by different colors. In (a), the disc is divided into three regions, , and , where the two black boundaries are specified by and , respectively. In (b), the first boundary is properly learned by ILRs, while the second one is learned wrongly. This is because the second label is highly correlated to the first label, but ILRs ignores such correlation. As a result, ILRs wrongly predicted the impossible case of . The misclassification rate measured by 0-1 loss is 0.197. In contrast, CorrLog predicts correct labels for most instances with a 0-1 loss 0.068. Besides, it is interesting to note that the correlation between the two labels are “asymmetric”, for the first label is not affected by the second. This asymmetry contributes the most to the misclassification of CorrLog, because the previous definition implies that only symmetric correlations are modelled in CorrLog.
VI Multilabel Image Classification
In this section, we apply the proposed CorrLog to multilabel image classification. In particular, four multilabel image datasets are used in this paper, including MULAN scene (MULANscene)333http://mulan.sourceforge.net/, MIT outdoor scene (MITscene) [38], PASCAL VOC 2007 (PASCAL07) [57] and PASCAL VOC 2012 (PASCAL12) [58]. MULAN scene dataset contains 2047 images with 6 labels, and each image is represented by 294 features. MIT outdoor scene dataset contains 2688 images in 8 categories. To make it suitable for multilabel experiment, we transformed each category label with several tags according to the image contents of that category444The 8 categories are coast, forest, highway, insidecity, mountain, opencountry, street, and tallbuildings. The 8 binary tags are building, grass, cement-road, dirt-road, mountain, sea, sky, and tree. The transformation follows, , , , , , , , . For example, coast is tagged with sea and sky .. PASCAL VOC 2007 dataset consists of 9963 images with 20 labels. For PASCAL VOC 2012, we use the available train-validation subset which contains 11540 images. In addition, two kinds of features are adopted to represent the last three datasets, i.e., the PHOW (a variant of dense SIFT descriptors extracted at multiple scales) features [39] and deep CNN (convolutional neural network) features [42, 43]. Summary of the basic information of the datasets is illustrated in Table II. To extract PHOW features, we use the VLFeat implementation [59]. For deep CNN features, we use the ’imagenet-vgg-f’ model pretrained on ImageNet database [43] which is available in MatConvNet matlab toolbox [60].
VI-A A Warming-Up Qualitative Experiment
As an extension to our previous work on CorrLog, this paper utilizes elastic net to inherit individual advantages of and regularization. To build up the intuition, we employ MITscene with PHOW features to visualize the difference between and elastic net regularization. Table III presents the learned CorrLog label graphs using these two types of regularization respectively. In the label graph, the color of each edge represents the correlation strength between two certain labels. We have also listed representative example images, one for each category, and their binary tags for completeness.
According to the comparison, one can see that elastic net regularization results in a sparse label graph due to its component, while regularization can only lead to a fully-connected label graph. In addition, the learned label correlations in elastic net case are more reasonable than that of . For example, in the label graph, dirt-road and mountain have weekly positive correlation (according to the link between them), though they seldom co-occur on the images in the datasets, while in the elastic net graph, their correlation is corrected as negative. It has to be confessed that elastic net regularization also discarded some reasonable correlations such as cement-road and building. This phenomenon is a direct result of the compromise between learning stability and model sparsity. We shall mention that those reasonable correlations can be maintained by decreasing , or , though more unreasonable connections will also be maintained. Thus, applying weak sparsity may impair the model performance. As a result, it is important to choose a good level of sparsity to achieve a compromise. In our experiments, CorrLog with elastic net regularization generally outperforms that with regularization, which confirms our motivation that appropriate level of sparsity in feature selection and label correlations help boost the performance of MLC. In the following presentation, we will use CorrLog with elastic net regularization in all experimental comparisons. To benefit following research, our code is available upon request.
VI-B Quantitative Experimental Setting
In this subsection, we present further comparisons between CorrLog and other MLC methods. First, to demonstrate the effectiveness of utilizing label correlation, we first compare CorrLog’s performance with ILRs. Moreover, four state-of-the-art MLC methods - instance-based learning by logistic regression (IBLR) [6], multilabel k-nearest neighbour (MLkNN) [5], classifier chains (CC) [18] and maximum margin output coding (MMOC) [21] were also employed for comparison study. Note that ILRs can be regarded as the basic baseline and other methods represent state-of-the-arts. In our experiments, LIBlinear [61] -regularized logistic regression is employed to build binary classifiers for ILRs. As for other methods, we use publicly available codes in MEKA555http://meka.sourceforge.net/ or the authors’ homepages.
We used six different measures to evaluate the performance. These include different loss functions (Hamming loss and zero-one loss) and other popular measures (accuracy, F1 score, Macro-F1 and Micro-F1). The details of these evaluation measures can be found in [62, 15, 18, 19]. The parameters for CorrLog are fixed across all experiments as , and . On each dataset, all the methods are compared by 5-fold cross validation. The mean and standard deviation are reported for each criterion. In addition, paired t-tests at 0.05 significance level is applied to evaluate the statistical significance of performance difference.
VI-C Quantitative Results and Discussions
Tables IV, V, VI and VII summarized the experimental results on MULANscene, MITscene, PASCAL07 and PASCAL12 of all six algorithms evaluated by the six measures. By comparing the results of CorrLog and ILRs, we can clearly see the improvements obtained by exploiting label correlations for MLC. Except the Hamming loss, CorrLog greatly outperforms ILRs on all datasets. Especially, the reduction of zero-one loss is significant on all four datasets with different type of features. This confirms the value of correlation modelling to joint prediction. However, it should be noticed that the improvement of CorrLog over ILRs is less significant when the performance is measured by Hamming loss. This is because Hamming loss treats the prediction of each label individually.
In addition, CorrLog is more effective in exploiting label correlations than other four state-of-the-art MLC algorithms. For MULANscene dataset, CorrLog achieved comparable results with IBLR and both of them outperformed other methods. For MITscene dataset, both PHOW and CNN features are very effective representations and boost the classification results. As a consequence, the performance of CorrLog and the four MLC algorithms are very close to each other. It is worth noting that, the MMOC method is time-consuming in the training stage, though it achieved the best performance on this dataset. As for both PASCAL07 and PASCAL12 datasets, CNN features perform significantly better than PHOW features. CorrLog obtained much better results than the competing MLC schemes, except for the Hamming loss and zero-one loss. Note that the CorrLog also performs competitively with PLEM and CGM, according to the results reported in [45].
VI-D Complexity Analysis and Execution Time
Table VIII summarizes the algorithm computational complexity of all MLC methods. The training computational cost of both CorrLog and ILRs are linear to the number of labels, while CorrLog causes more testing computational cost than ILRs due to the iterative belief propagation algorithm. In contrast, the training complexity of CC and MMOC are polynomial to the number of labels. The two instance-based methods, MLkNN and IBLR, are relatively computational in both train and test stages due to the involvement of instance-based searching of nearest neighbours. In particular, training MLkNN requires estimating the prior label distribution from training data which needs the consideration of all nearest neighbours of all training samples. Testing a given sample in MLkNN consists of finding its -nearest neighbours and applying maximum a posterior (MAP) inference. Different from MLkNN, IBLR constructs logistic regression models by adopting labels of -nearest neighbours as features.
To evaluate the practical efficiency, Table IX presents the execution time (train and test phase) of all comparison algorithms under Matlab environment. A Linux server equipped with Intel Xeon CPU ( cores GHz) and GB memory is used for conducting all the experiments. CorrLog is implemented in Matlab language, while ILRs is implemented based on LIBlinear’s mex functions. MMOC is evaluated using the authors’ Matlab code which also builds upon LIBlinear. As for IBLR, MLkNN and CC, the MEKA Java library is called via a Matlab wrapper. Based on the comparison results, the following observations can be made:
- the execution time is largely consistent with the complexity analysis, though there maybe some unavoidable computational differences between Matlab scripts, mex functions and Java codes;
- CorrLog’s train phase is very efficient and its test phase is also comparable with ILRs, CC and MMOC;
- CorrLog is more efficient than IBLR and MLkNN in both train and test stages.
VII Conclusion
We have proposed a new MLC algorithm CorrLog and applied it to multilabel image classification. Built upon IRLs, CorrLog explicitly models the pairwise correlation between labels, and thus improves the effectiveness for MLC. Besides, by using the elastic net regularization, CorrLog is able to exploit the sparsity in both feature selection and label correlations, and thus further boost the performance of MLC. Theoretically, we have shown that the generalization error of CorrLog is upper bounded and is independent of the number of labels. This suggests the generalization bound holds with high confidence even when the number of labels is large. Evaluations on four benchmark multilabel image datasets confirm the effectiveness of CorrLog for multilabel image classification and show its competitiveness with the state-of-the-arts.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] W. Bian and D. Tao, “Asymptotic generalization bound of fisher’s linear discriminant analysis,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 36, no. 12, pp. 2325–2337, 2014.
- 2[2] T. Liu and D. Tao, “On the performance of manhattan nonnegative matrix factorizatio,” IEEE Trans. Neural Netw. Learn. Syst. , vol. PP, no. 99, pp. 1–1, 2016.
- 3[3] C. Xu, D. Tao, and C. Xu, “Multi-view intact space learning,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 37, no. 12, pp. 2531–2544, 2015.
- 4[4] T. Liu, D. Tao, S. Mingli, and M. Stephen, “Algorithm-dependent generalization bounds for multi-task learning,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. PP, no. 99, pp. 1–1, 2016.
- 5[5] M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern Recognit. , vol. 40, no. 7, pp. 2038–2048, 2007.
- 6[6] W. Cheng and E. Hüllermeier, “Combining instance-based learning and logistic regression for multilabel classification,” Mach. Learn. , vol. 76, no. 2-3, pp. 211–225, 2009.
- 7[7] D. Hsu, S. Kakade, J. Langford, and T. Zhang, “Multi-label prediction via compressed sensing.” in Proc. Adv. Neural Inf. Process. Syst. , vol. 22, 2009, pp. 772–780.
- 8[8] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,” in Data Mining and Knowledge Discovery Handbook , O. Maimon and L. Rokach, Eds. Boston, MA: Springer US, 2010, pp. 667–685.
