Large-Margin Multiple Kernel Learning for Discriminative Features   Selection and Representation Learning

Babak Hosseini; Barbara Hammer

arXiv:1903.03364·cs.LG·March 14, 2019

Large-Margin Multiple Kernel Learning for Discriminative Features Selection and Representation Learning

Babak Hosseini, Barbara Hammer

PDF

TL;DR

This paper introduces a multi-class large-margin MKL framework that enhances class separation, performs discriminative feature selection with sparsity, and achieves competitive accuracy and interpretability in real-world datasets.

Contribution

It proposes a novel multi-class MKL method with large-margin optimization and sparsity for improved class separation and feature selection, advancing beyond binary and linear assumptions.

Findings

01

Achieves competitive classification accuracy on real-world datasets.

02

Learns sparse kernel weights for interpretable feature selection.

03

Enhances local class separation in the feature space.

Abstract

Multiple kernel learning (MKL) algorithms combine different base kernels to obtain a more efficient representation in the feature space. Focusing on discriminative tasks, MKL has been used successfully for feature selection and finding the significant modalities of the data. In such applications, each base kernel represents one dimension of the data or is derived from one specific descriptor. Therefore, MKL finds an optimal weighting scheme for the given kernels to increase the classification accuracy. Nevertheless, the majority of the works in this area focus on only binary classification problems or aim for linear separation of the classes in the kernel space, which are not realistic assumptions for many real-world problems. In this paper, we propose a novel multi-class MKL framework which improves the state-of-the-art by enhancing the local separation of the classes in the feature…

Tables4

Table 1. Table 1 : Caltech-101: Comparison of classification accuracies ( % percent \% ).

Method	$#$ Training samples per class ( $N_{t r}$ )
Method	5	10	15	20	25	30
$k$ NN-ave	46.1	57.3	64.7	68.2	73.5	76.8
SVM-ave	49.7	59.2	64.8	69.7	74.4	77.3
DLK(2008) [5]	53.7	62.1	68.2	71.1	74.6	77.9
SimpleMKL(2008) [16]	–	53.6	–	63.4	–	76.4
Lasso-MKL(2010) [15]	–	60.1	–	70.7	–	80.7
RMKL(2012) [21]	54.7	66.4	71.3	74.3	76.8	78.8
KNMF-MKL(2015) [19]	53.5	65.2	71.5	78.6	79.8	81.1
GS-MKL(2012) [18]	–	66.2	75.1	81.5	83.7	84.3
MKL-DR(2011) [7]	58.4	68.8	74.5	77.5	79.8	81.4
DMKL(2016) [20]	59.1	69.3	75.2	81.4	83.5	83.7
MKL-TR(2014) [4]	59.8	69.4	75.8	82.3	84.1	84.6
LMMK(proposed)	57.6	68.2	76.2	84.4	86.2	88.6

Table 2. Table 2 : Caltech-101: Normalized kernel weights that LMMK assigned to each image descriptor, and k 𝑘 k NN accuracy for each base kernel.

Descriptor	Acc	$β$	Descriptor	Acc	$β$
SIFT-Dist	66.9	0.73	GB-Dis	72.3	1.00
SIFT-SPM	62.3	0	GB	67.4	0
PHOG	47.5	0	SS-Dist	64.7	0.15
C2-SWP	39.5	0	SS-SPM	62.3	0
C2-ML	57.7	0	GIST	59.3	0.31

Table 3. Table 3 : Comparison of classification accuracies ( % percent \% ) on Pascal VOC 2007 and Oxford Flowers17 datasets.

Method	Pascal VOC	Flowers17
$k$ NN-ave	55.2	81.9
SVM-ave	52.6	82.4
Can-MKL(2004) [2]	54.5	–
DLK(2008) [5]	56.3	83.5
RMKL(2012) [21]	59.3	85.9
KNMF-MKL(2015) [19]	61.1	84.6
GS-MKL(2012) [18]	62.5	–
MKL-DR(2011) [7]	62.5	85.7
DMKL(2016) [20]	64.7	88.3
MKL-TR(2014) [4]	64.2	89.5
LMMK(proposed)	69.4	93.8

Table 4. Table 4 : Comparison of accuracies ( A c c 𝐴 𝑐 𝑐 Acc ) and ‖ β → ‖ 0 subscript norm → 𝛽 0 \|\vec{\beta}\|_{0} on the MTS datasets.

Method	PEM		AUSLAN		UTKinect
Method	$A c c$	${‖ \vec{β} ‖}_{0}$	$A c c$	${‖ \vec{β} ‖}_{0}$	$A c c$	${‖ \vec{β} ‖}_{0}$
$k$ NN-ave	75.6	963	83.1	128	83.7	60
SVM-ave	83.2	963	87.2	128	85.4	60
DLK [5]	84.1	171	87.9	79	86.3	41
RMKL [21]	84.9	690	88.7	95	88.3	55
KNMF-MKL [19]	85.7	742	88.3	101	87.5	52
MKL-DR [7]	86.4	220	89.6	65	88.7	37
DMKL [20]	88.2	64	91.3	47	90.7	28
MKL-TR [4]	88.5	81	91.1	31	91.4	25
LMMK(proposed)	91.3	75	92.1	39	95.6	20

Equations22

\begin{array}[]{l}\hat{\Phi}({\vec{x}})=[\sqrt{{\beta}_{1}}\Phi_{1}^{\top}({\vec{x}}),\dots,\sqrt{{\beta}_{d}}\Phi_{d}^{\top}({\vec{x}})]^{\top},\end{array}

\begin{array}[]{l}\hat{\Phi}({\vec{x}})=[\sqrt{{\beta}_{1}}\Phi_{1}^{\top}({\vec{x}}),\dots,\sqrt{{\beta}_{d}}\Phi_{d}^{\top}({\vec{x}})]^{\top},\end{array}

\hat{K} (x_{i}, x_{j}) = m = 1 \sum d β_{m} K_{m} (x_{i}, x_{j}) .

\hat{K} (x_{i}, x_{j}) = m = 1 \sum d β_{m} K_{m} (x_{i}, x_{j}) .

\begin{array}[]{ll}\vec{\beta}=\underset{\vec{\beta}\in\mathbf{S}}{\arg\min}&loss(\{\{{\mathcal{K}}_{m}({\vec{x}}_{i},{\vec{x}}_{j})\}_{i,j=1}^{N}\}_{m=1}^{d},\vec{\beta},{\vec{h}}),\end{array}

\begin{array}[]{ll}\vec{\beta}=\underset{\vec{\beta}\in\mathbf{S}}{\arg\min}&loss(\{\{{\mathcal{K}}_{m}({\vec{x}}_{i},{\vec{x}}_{j})\}_{i,j=1}^{N}\}_{m=1}^{d},\vec{\beta},{\vec{h}}),\end{array}

D_{L} (x_{i}, x_{j}) = (x_{i} - x_{j})^{⊤} L^{⊤} L (x_{i} - x_{j})

D_{L} (x_{i}, x_{j}) = (x_{i} - x_{j})^{⊤} L^{⊤} L (x_{i} - x_{j})

\begin{array}[]{ll}\underset{\mathbf{L}}{\min}&(1-\mu)\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\mathcal{D}_{\mathbf{L}}(\vec{x}_{i},\vec{x}_{j})+\mu\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\underset{l\in\mathcal{I}^{k}_{i}}{\sum}\xi_{ijl}\\ \mathrm{s.t.}&\mathcal{D}_{\mathbf{L}}(\vec{x}_{i},\vec{x}_{l})-\mathcal{D}_{\mathbf{L}}(\vec{x}_{i},\vec{x}_{j})\geq 1-\xi_{ijl}\\ &\xi_{ijl}\geq 0,\end{array}

\begin{array}[]{ll}\underset{\mathbf{L}}{\min}&(1-\mu)\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\mathcal{D}_{\mathbf{L}}(\vec{x}_{i},\vec{x}_{j})+\mu\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\underset{l\in\mathcal{I}^{k}_{i}}{\sum}\xi_{ijl}\\ \mathrm{s.t.}&\mathcal{D}_{\mathbf{L}}(\vec{x}_{i},\vec{x}_{l})-\mathcal{D}_{\mathbf{L}}(\vec{x}_{i},\vec{x}_{j})\geq 1-\xi_{ijl}\\ &\xi_{ijl}\geq 0,\end{array}

\hat{Φ} (x) (i) = j \sum l_{ij} Φ (x) (j) having \hat{Φ} (x) = L Φ (x),

\hat{Φ} (x) (i) = j \sum l_{ij} Φ (x) (j) having \hat{Φ} (x) = L Φ (x),

\begin{array}[]{ll}\underset{\vec{\beta}}{\min}&(1-\mu)\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\mathcal{D}^{\phi}_{\vec{\beta}}({\vec{x}}_{i},{\vec{x}}_{j})\\ &+\mu\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\underset{l\in\mathcal{I}^{k}_{i}}{\sum}\xi_{ijl}+\lambda\sum_{m}{\beta}_{m}\\ \mathrm{s.t.}&\mathcal{D}^{\phi}_{\vec{\beta}}(\vec{x}_{i},\vec{x}_{l})-\mathcal{D}^{\phi}_{\vec{\beta}}(\vec{x}_{i},\vec{x}_{j})\geq 1-\xi_{ijl}\\ &\xi_{ijl}\geq 0,~{}~{}{\beta}_{m}\geq 0.\end{array}

\begin{array}[]{ll}\underset{\vec{\beta}}{\min}&(1-\mu)\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\mathcal{D}^{\phi}_{\vec{\beta}}({\vec{x}}_{i},{\vec{x}}_{j})\\ &+\mu\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\underset{l\in\mathcal{I}^{k}_{i}}{\sum}\xi_{ijl}+\lambda\sum_{m}{\beta}_{m}\\ \mathrm{s.t.}&\mathcal{D}^{\phi}_{\vec{\beta}}(\vec{x}_{i},\vec{x}_{l})-\mathcal{D}^{\phi}_{\vec{\beta}}(\vec{x}_{i},\vec{x}_{j})\geq 1-\xi_{ijl}\\ &\xi_{ijl}\geq 0,~{}~{}{\beta}_{m}\geq 0.\end{array}

D_{β}^{ϕ} (x_{i}, x_{j}) = [Φ (x_{i}) - Φ (x_{j})]^{⊤} B [Φ (x_{i}) - Φ (x_{j})],

D_{β}^{ϕ} (x_{i}, x_{j}) = [Φ (x_{i}) - Φ (x_{j})]^{⊤} B [Φ (x_{i}) - Φ (x_{j})],

\begin{array}[]{l}\mathcal{D}^{\phi}_{\vec{\beta}}({\vec{x}}_{i},{\vec{x}}_{j})=\\ \sum_{m=1}^{d}{\beta}_{m}[{\mathcal{K}}_{m}({\vec{x}}_{i},{\vec{x}}_{i})+{\mathcal{K}}_{m}({\vec{x}}_{j},{\vec{x}}_{j})-2{\mathcal{K}}_{m}({\vec{x}}_{i},{\vec{x}}_{j})].\end{array}

\begin{array}[]{l}\mathcal{D}^{\phi}_{\vec{\beta}}({\vec{x}}_{i},{\vec{x}}_{j})=\\ \sum_{m=1}^{d}{\beta}_{m}[{\mathcal{K}}_{m}({\vec{x}}_{i},{\vec{x}}_{i})+{\mathcal{K}}_{m}({\vec{x}}_{j},{\vec{x}}_{j})-2{\mathcal{K}}_{m}({\vec{x}}_{i},{\vec{x}}_{j})].\end{array}

\begin{array}[]{ll}\underset{\vec{\beta}}{\min}&(1-\mu)(\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}[1-{\mathcal{K}}_{(:)}({\vec{x}}_{i},{\vec{x}}_{j})])\vec{\beta}\\ &+\mu\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\underset{l\in\mathcal{I}^{k}_{i}}{\sum}\xi_{ijl}+\lambda\sum_{m=1}^{d}{\beta}_{m}\\ \mathrm{s.t.}&2[1+{\mathcal{K}}_{(:)}({\vec{x}}_{i},{\vec{x}}_{j})-{\mathcal{K}}_{(:)}({\vec{x}}_{i},{\vec{x}}_{l})]\vec{\beta}\geq 1-\xi_{ijl}\\ &\xi_{ijl}\geq 0,~{}~{}{\beta}_{m}\geq 0,\end{array}

\begin{array}[]{ll}\underset{\vec{\beta}}{\min}&(1-\mu)(\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}[1-{\mathcal{K}}_{(:)}({\vec{x}}_{i},{\vec{x}}_{j})])\vec{\beta}\\ &+\mu\underset{i,j\in\mathcal{N}^{k}_{i}}{\sum}\underset{l\in\mathcal{I}^{k}_{i}}{\sum}\xi_{ijl}+\lambda\sum_{m=1}^{d}{\beta}_{m}\\ \mathrm{s.t.}&2[1+{\mathcal{K}}_{(:)}({\vec{x}}_{i},{\vec{x}}_{j})-{\mathcal{K}}_{(:)}({\vec{x}}_{i},{\vec{x}}_{l})]\vec{\beta}\geq 1-\xi_{ijl}\\ &\xi_{ijl}\geq 0,~{}~{}{\beta}_{m}\geq 0,\end{array}

K_{m} (x_{i}, x_{j}) = e x p (- D (x_{i}^{m}, x_{j}^{m})^{2} / δ_{m}),

K_{m} (x_{i}, x_{j}) = e x p (- D (x_{i}^{m}, x_{j}^{m})^{2} / δ_{m}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Large-Margin Multiple Kernel Learning for Discriminative Features Selection

and Representation Learning

Babak Hosseini

CITEC cluster of excellence

Bielefeld University, Germany

[email protected]

&Barbara Hammer

CITEC cluster of excellence

Bielefeld University, Germany

[email protected]

Preprint of the publication [1], as provided by the authors. The final publication is available at IEEE Xplore via https://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1000500

Abstract

Multiple kernel learning (MKL) algorithms combine different base kernels to obtain a more efficient representation in the feature space. Focusing on discriminative tasks, MKL has been used successfully for feature selection and finding the significant modalities of the data. In such applications, each base kernel represents one dimension of the data or is derived from one specific descriptor. Therefore, MKL finds an optimal weighting scheme for the given kernels to increase the classification accuracy. Nevertheless, the majority of the works in this area focus on only binary classification problems or aim for linear separation of the classes in the kernel space, which are not realistic assumptions for many real-world problems. In this paper, we propose a novel multi-class MKL framework which improves the state-of-the-art by enhancing the local separation of the classes in the feature space. Besides, by using a sparsity term, our large-margin multiple kernel algorithm (LMMK) performs discriminative feature selection by aiming to employ a small subset of the base kernels. Based on our empirical evaluations on different real-world datasets, LMMK provides a competitive classification accuracy compared with the state-of-the-art algorithms in MKL. Additionally, it learns a sparse set of non-zero kernel weights which leads to a more interpretable feature selection and representation learning.

Keywords-Multiple Kernel Learning, Feature Selection, Representation Learning, LMNN.

1 Introduction

Multiple kernel learning (MKL) algorithms utilize different data representations in the feature space (base kernels) to obtain an optimal representation upon their combination [2]. We can generally formulate an MKL problem as the minimization of a loss term defined in the Reproducing Kernel Hilbert Space (RKHS). This cost function usually reflects how separated the data classes are in the RKHS according to a given classification task [3]. Depending on the definition of the problem, MKL can be seen as either finding the best parameter values for a specific type of kernel function [4, 5, 6, 3] or learning a weighting vector associated to the pre-computed base kernel [7, 8, 9, 10, 11].

In image processing problems, it is a common practice to derive specific representations by utilizing different types of image descriptors. Therefore, an MKL algorithm can learn which descriptors provide more discriminative representations of the data classes [7, 8]. Analogously, by computing each base kernel from one specific dimension of the data, MKL can perform discriminative feature selection by assigning larger weights to the most discriminative dimensions of the data [8, 11, 12, 9]. In practice, any MKL algorithm can also be considered as a multiple kernel feature selection method (MK-FS) provided that it can take pre-computed kernel representations as the inputs.

The significant well-studied group of MKL methods is applicable only to the binary-classification problems [13, 14, 9, 15, 16, 8]. These algorithms are generally constructed to improve the performance of the Support Vector Machines (SVM) as a binary classifier. It is possible to apply these binary MKL methods to multi-class problems throughout defining an ensemble of binary classification tasks, and for each of which train an individual MKL model [17, 8, 18]. However, such strategy results in several kernel combination schemes learned from the individual binary classifiers and generally does not lead to a unanimous feature embedding.

On the other hand, some recent works have tried to extend MKL to the multi-class problems via defining seamless optimization schemes by considering all the classes together [5, 7, 4, 19, 20, 21]. As a common characteristic, these algorithms try to learn the optimal kernel weights independently of the later on classifier’s structure. Inspired by the Fisher Linear Discriminant Analysis (LDA) [22], algorithms similar to DKL [5], MKL-DR [7] and MKL-TR [4] are focused on reducing the intra-class covariances via using the scatter matrices of data in different RKHSs. In particular, the MKL-DR and MKL-TR methods employ low-dimensional projections, while the latter also applies the convex combination of the base kernels. As a different approach, RMKL method [21] performs singular value decomposition to find the base kernels which lead to maximum variation in the space spanned by them. It is claimed that this decomposition finds a more discriminative kernel combination than the original RKHS. Similarly, KNMF-MKL [19] was proposed by reformulating the RMKL approach using the non-negative matrix factorization framework (NMF) [23].

To emphasize the noteworthy shortcomings of the existing MKL algorithms, we distinguish them into two general categories:

First, algorithms similar to [8, 9, 16, 24, 14] focus on learning a multiple kernel mapping to a target RKHS in which a classifier can linearly separate the different classes from each other. This objective coincides with the basic principle of the kernelized SVM’s structure [25] which is the linear separation of the classes in the feature space. Nevertheless, obtaining such an ideal representation is usually not affordable for real-world data, or it demands considerable domain knowledge for the specific design of such efficient kernels. This category generally includes binary MKL algorithms.

Second, another group of MKL methods includes algorithms such as [13, 5, 7, 20] which follow methodologies analogous to the kernelized LDA’s design scheme [26]. They generally try to obtain a multiple kernel representation in a way that the class distributions in RKHS would be generally condensed. This strategy is effective, especially for multi-class problems. Nevertheless, as a common observation in real data, some classes consist of sub-clusters which are located on different regions of the space, but are yet well separated from other classes (e.g., having an XOR distribution in the feature space). In such cases, it is generally difficult to find a target RKHS in which the classes are globally condensed, especially without doing any feature engineering [27]. This shortcoming is fundamentally problematic for the classifiers which rely on linear separability of the classes (e.g., SVM).

By deriving each base kernel from a different source of information in the data, it is highly possible to observe substantial redundancy between these representations[10]. Therefore, it is desirable to reduce this redundancy in favor of the model’s interpretation and its discriminability. In the works similar to SimpleMKL [16] and class-specific MKL [28], they imposed sparsity on the weights of the base kernels by using a convex combination in the MKL problem. As an improvement, Group Lasso-MKL fused the MKL problem with the $l_{p}$ -norm based on the group Lasso optimization [29] to better enforce the sparsity concern [15]. In comparison, SparseRMKL [17] benefits from an $l_{1}$ -norm constraint in its optimization framework, which provides a better classification performance as well as an enhanced interpretation by specifying the most discriminative contributions among the set of the base kernels.

1.1 Motivation and Contributions

Metric learning is the idea of finding an appropriate distance metric which transforms the data into a new space in which the data distribution provides a more smooth labeling than the original space [30, 31, 32]. Based on practical evidence, performing metric learning can notably enhance accuracy of distance-based classifiers (e.g., $k$ NN) on the test data even by applying a linear mapping on the input space [32, 33]. One of the successful distance metric learning algorithms is the Large-Margin Nearest Neighbor (LMNN) which increases the maximum margin between the data instances of different classes [34]. In contrast to the global separation of the classes via a hyperplane in SVM, LMNN learns a distance metric which improves the local separation of the classes in small neighborhoods of the space. According to [34], the LMNN’s resulted metric can improve the $k$ NN’s classification accuracy even in comparison with the kernelized SVM. Therefore, we expect that employing metric learning in MKL framework could result in an RKHS in which the $k$ NN’s discriminative performance can outperform other MKL models.

Contributions: In this work, we introduce the metric learning concept to the MKL problem by optimizing a diagonal Mahalanobis metric in the feature space. Our proposed large-margin multiple kernel algorithm (LMMK) improves the local separation of the classes in a resulted RKHS, in which it imposes a large margin between data vectors from the different classes. The specific formulation of LMMK converts the above metric learning problem into finding an optimal combination of the given base kernels in an MKL framework. It is a multi-class MKL method which results in an efficient data representation for the $k$ NN classifier in the feature space. Furthermore, by employing a sparsity term in the convex optimization framework of LMMK, it behaves as an effective MK-FS algorithm. More precisely, it selects the small subset of essential features to enhance the described local class-separation objective.

2 Preliminaries

2.1 Multiple Kernel Learning

The training set $\{({\vec{x}}_{i},h_{i})\}_{i=1}^{N}$ includes $N$ data samples ${\vec{x}}_{i}\in\mathbb{R}^{n}$ , where $h_{i}\in\{1,2,\dots,c\}$ denotes the corresponding label of ${\vec{x}}_{i}$ in a $c$ -class setting. Implicitly, we can assume $d$ non-linear mapping functions $\{\Phi_{m}:\mathbb{R}\rightarrow\mathbb{R}^{f_{m}}\}_{m=1}^{d}$ exist which map ${\vec{x}}$ into individual RKHSs [2, 35]. Therefore, we can obtain a scaling of the feature space based on the following weighted concatenation:

[TABLE]

where $\hat{\Phi}({\vec{x}})$ is the implicit mapping to the resulted RKHS, and $\vec{\beta}$ is the combination vector. Due to the finiteness of training samples ${\vec{x}}_{i}$ the target of each implicit mapping $\Phi_{m}$ is assumed a finite-dimensional Hilbert space which validates the concatenation of the embeddings in Eq. (1). By relating each $\Phi_{m}({\vec{x}})$ to a kernel function ${\mathcal{K}}_{m}({\vec{x}}_{i},{\vec{x}}_{j})=\Phi_{m}^{\top}({\vec{x}}_{i})\Phi_{m}({\vec{x}}_{j})$ , we can compute the weighted kernel function $\hat{{\mathcal{K}}}({\vec{x}}_{i},{\vec{x}}_{j})$ corresponding to $\hat{\Phi}({\vec{x}})$ as the additive combination [8]

[TABLE]

Generally, one can formulate the MKL frameworks as variants of the following optimization

[TABLE]

in which the $loss$ term is a cost function that its minimization reflects the given classification task and is also defined by considering the classifier’s model. The set $\mathbf{S}$ defines the set of employed constraints on $\vec{\beta}$ based on the MKL algorithm.

If we apply each kernel function ${\mathcal{K}}_{m}$ only on the $m$ th dimension of the training data (resulting in $n$ feature-kernels), we can assume each corresponding $\Phi_{m}$ in Eq. (1) maps the $m$ th dimension of the data into one individual RKHS. In that case, each solution for Eq. (3) represents a weighted feature selection obtained by the MKL algorithm based on the defined discriminative function $loss$ and the constraints in $\mathbf{S}$ . It is practical to apply a non-negativity constraint on each ${\beta}_{m}$ to make the resulted kernel weights interpretable as the relative importance of each feature representation to the given discriminative task [3]. Furthermore, including sparsity terms in Eq. (3) can decrease the redundancy in the above importance profile [10, 16, 28, 15, 17]. For instance, if the individual RKHSs are correlated, their corresponding entries in $\vec{\beta}$ are preferred to be considerably sparse.

2.2 Large-Margin Nearest Neighbor

The LMNN algorithm learns the Mahalanobis distance metric

[TABLE]

throughout finding the linear mapping matrix $\mathbf{L}\in\mathbb{R}^{n\times n}$ [34]. For each ${\vec{x}}_{i}$ , LMNN tries to map it closer to the data samples belonging to the class ${\vec{h}}_{i}$ (targets), while pushing it away from the data points with labels other than ${\vec{h}}_{i}$ (impostors) (Figure 1). To that aim, LMNN uses the following convex optimization:

[TABLE]

in which $\mathcal{N}^{k}_{i}$ and $\mathcal{I}^{k}_{i}$ contain the indices of the $k$ -nearest targets and impostors of $\vec{x}_{i}$ respectively. The scalar $\mu\in[0~{}1]$ makes a trade-off between the pulling (first) and pushing (second) parts of the objective in Eq. (5). Additionally, each positive slack variable $\xi_{ijl}$ is related to a triple $({\vec{x}}_{i},{\vec{x}}_{j},{\vec{x}}_{l})$ , in which ${\vec{x}}_{j}$ and ${\vec{x}}_{l}$ are respectively a target for ${\vec{x}}_{i}$ and an impostor which is located between ${\vec{x}}_{i}$ and ${\vec{x}}_{j}$ (similar to Figure 1-left). The scalars $\xi_{ijl}$ model the costs induced by the existing impostors.

3 Large-Margin Multiple Kernel Learning

We apply the metric learning concept to the data distribution in the feature space, such that it results in having dense neighborhoods of classes in which the different classes can be locally separated. Assuming that the dimensions of the feature space are related to individual RKHSs as in Eq. (1), we employ metric learning to find the effective $\vec{\beta}$ that serves the above purpose. However, direct application of Eq. (5) in the feature space has the following limitations:

First, via applying the Mahalanobis metric of Eq. (4) to the feature space, the dimensions of the resulted $\hat{\Phi}({\vec{x}})$ lose their interpretability. Denoting $\Phi({\vec{x}})$ as the non-weighted concatenation of the base kernels in Eq. (1) (setting ${\beta}_{m}=1~{}\forall m$ ),

[TABLE]

in which $\Phi({\vec{x}})(i)$ and $\hat{\Phi}({\vec{x}})(i)$ denote the $i$ th dimension of $\Phi({\vec{x}})$ and $\hat{\Phi}({\vec{x}})$ respectively in the feature space, and $l_{ij}$ indicates the $i$ th row from the $j$ th column of $\mathbf{L}$ . Consequently, each dimension of $\hat{\Phi}({\vec{x}})$ in the resulted RKHS loses its physical interpretation, as it is a weighted combination of the dimensions of the original RKHS.

Second, computing Eq. (4) in the feature space (as in Eq. (6)) requires explicit access to the dimensions of each $\Phi_{i}({\vec{x}})$ in the feature space. This requirement cannot be directly fulfilled as it is contrary to our assumption about the implicit definition $\Phi_{i}({\vec{x}})$ .

To overcome the above issues, we propose the following optimization scheme:

[TABLE]

In Eq. (7), the distance metric $\mathcal{D}^{\phi}_{\vec{\beta}}({\vec{x}}_{i},{\vec{x}}_{j})$ is defined in the feature space as:

[TABLE]

where $\mathbf{\mathbf{B}}$ is a diagonal matrix formed based on the entries of $\vec{\beta}$ . Eq. (8) defines a Mahalanobis metric in the feature space with a diagonal covariance matrix $\mathbf{B}$ . Therefore, we name $\mathcal{D}^{\phi}_{\vec{\beta}}({\vec{x}}_{i},{\vec{x}}_{j})$ a diagonal metric. Consequently, each learned ${\beta}_{m}$ in Eq. (7) acts as a selection weight for the $m$ th representation of the data in the original RKHS to locally discriminate the classes in the feature space (similar to Figure 1). Additionally, the last objective term in this optimization problem applies an $l_{1}$ -regularization to enforce the selection of the most relevant feature-kernels $\Phi_{m}({\vec{x}})$ to the defined discriminative objective. Therefore, our LMMK framework in Eq. (7) is an MKL optimization problem which is designed for discriminative feature selection and representation learning.

3.1 Optimization

Based on Eq. (2), the pair-wise distance between each couple of $({\vec{x}}_{i},{\vec{x}}_{j})$ in the feature space is computed as

[TABLE]

Hence, we can compute $\mathcal{D}^{\phi}_{\vec{\beta}}({\vec{x}}_{i},{\vec{x}}_{j})$ without performing any explicit calculation in the feature space in contrast to Eq. (4). In addition, by normalizing the kernel matrices of the training set, we have ${\mathcal{K}}_{m}({\vec{x}}_{i},{\vec{x}}_{i})=1$ for all the input vectors and base kernels. Therefore, after eliminating the constant terms, the optimization problem of Eq. (7) is simplified to

[TABLE]

where ${\mathcal{K}}_{(:)}({\vec{x}}_{i},{\vec{x}}_{j})=[{\mathcal{K}}_{1}({\vec{x}}_{i},{\vec{x}}_{j}),\dots,{\mathcal{K}}_{d}({\vec{x}}_{i},{\vec{x}}_{j})]\in\mathbb{R}^{d}$ . This optimization framework is a convex problem subject to the advance selection of the targets and impostors which are indexed by $\mathcal{N}^{k}_{i}$ and $\mathcal{I}^{k}_{i}$ respectively. Hence, it is an instance of the non-negative linear programming (LP), and we can efficiently optimize it via using solvers such as YALMIP [36] or CVX [37]. Additionally, similar to a practical hint from [34], we repeat the optimization loop for a few iterations while updating $\mathcal{N}^{k}_{i}$ and $\mathcal{I}^{k}_{i}$ at the end of each run. These few extra repetitions can lead to more optimal solutions. For the efficient implementation of Eq. (10), the code of LMMK algorithm would be accessible via an online public repository111https://github.com/bab-git/LMMK.

3.2 Classification of Test Data

We perform the classification of each test data sample ${\vec{z}}$ by using the $k$ NN algorithm based on the distances in the resulted RKHS. To that aim, we compute $\mathcal{D}^{\phi}_{\vec{\beta}}({\vec{z}},{\vec{x}}_{i})$ as the distance between ${\vec{z}}$ and each training sample using the learned diagonal matrix $\mathbf{B}$ in the feature space analogous to Eq. (9).

3.3 Complexity and Convergence of LMMK

The optimization framework of Eq. (10) is an LP problem, and consequently, it converges in limited $t$ steps to an optimal solution. On the other hand, an LP solver optimizes $\vec{\beta}$ with the computational complexity of $\mathcal{O}(t(2d+3N_{l})+dN_{j}+2dN_{l})$ , in which $N_{l}$ and $N_{j}$ are the total number of targets and the size of $\vec{\xi}$ respectively. Based on the definition of the targets and impostors, we have $N_{l}\approx\frac{N^{2}(c-1)}{c}$ and $N_{j}=kN$ . In addition, for common real-world datasets we observe $N>>t$ in practice; hence, the total time complexity of the algorithm is approximately $\mathcal{O}(N^{2})$ . This complexity is almost comparable to that of computing the base kernel matrices for each dataset before running the algorithm.

4 Experiments

In this section, we implement our proposed LMMK algorithm on different real-world datasets and evaluate its performance by carrying out empirical comparisons to other MKL alternative algorithms. To that aim, we consider two different scenarios for our experiments:

Representation learning, in which we compute the base kernels upon different types of image descriptors on each the dataset. Hence, the results of MKL frameworks are interpreted as the most discriminative descriptors they select for each dataset. 2. 2.

Feature selection, where each base kernel is computed using one specific dimension of the data ( $d=n$ ), and MKL methods are expected to assign larger weights to the more discriminative features of the data.

In both scenarios, all the base kernels are computed using the Gaussian kernel function

[TABLE]

in which $\mathcal{D}({\vec{x}}^{m}_{i},{\vec{x}}^{m}_{j})$ indicates the pairwise distance between $({\vec{x}}_{i},{\vec{x}}_{j})$ based on the $m$ th representation of the input data, and $\delta_{m}$ denotes the average of $\mathcal{D}({\vec{x}}_{i}^{m},{\vec{x}}_{j}^{m})$ for all data samples.

4.1 Datasets

According to the discussed implementation scenarios, we choose two different types of datasets: 1) Image datasets for representation learning, 2) Multidimensional time-series (MTS) for discriminative feature selection.

Regarding image datasets, we make the following selection:

•

Caltech-101 [38] is a collection of 101 object categories which includes 40 to 800 images per class of object. The high inter-class variations within this dataset make it a challenging image classification benchmark. For our experiments, we choose 5 different training subsets with the sizes of 5, 10, 15, 20 25, and 30 images per class, and a testing subset of 15 images per category.

•

Pascal VOC 2007 [39] is a dataset consisting of 20 different classes of objects and is related to a classification challenge. Out of 9,963 imaged, we employ 50 $\%$ of the samples for training and the rest for testing as provided in [39].

•

Oxford Flowers17 [40] is a collection of images related to 16 different species of flowers and are composed of 80 images per category. The large intra-class variations for some flower species causes substantial overlapping instances in this dataset. As a common practice in the literature [40], we select 40 pre-defined images per class for training and preserve the rest for testing.

To evaluate LMMK’s performance for the discriminative feature selection scenario, we select the following real-world MTS datasets:

•

PEMS dataset [41] consists of the daily traffic information related to San Francisco bay freeways, and the classification task is to determine the correct day of the week related to each data sequence. It has 963 dimensions and 60 sequences per each of the 7 class.

•

AUSLAN is an MTS dataset from the UCI repository [42] containing 95 classes of Australian language signs. It includes 2565 samples of 128-dimension MTS sequences.

•

UTKinect is a dataset of human action recognition [43] including 60-dimension Kinect-based skeleton sequences related to 10 different actions, where each class contains 20 MTS sequences.

4.2 Baseline Algorithms

To have a proper evaluation of our proposed method, we make our comparison between LMMK and the following major MKL algorithms: MKL-TR [4], MKL-DR [7], DMKL [20], KNMF-MKL [19], and RMKL [21]. These algorithms are designed for multi-class MKL problems; hence, we can inspect their results from feature selection and representation learning perspectives. Also, as the baseline classifiers, we implement multi-class SVM [44] and $k$ NN using the average of the base kernels resulting in SVM-ave and $k$ NN-ave respectively.

Note: Although there exist various deep learning classifiers or object detection methods specially designed for image datasets, they do not fit the multiple kernel scope of our comparisons. Nevertheless, as a suggested extended experimental setting, one can use those methods as rich feature extraction techniques to obtain more discriminative base kernels for the MKL methods.

4.3 Experimental Setup

We evaluate the performance of the selected MKL algorithms based on classification accuracy $Acc=\frac{\#\text{correct predictions}}{\#\text{all data samples}}$ by taking the average of 10 random repetitions for each dataset. The LMMK algorithm’s hyper-parameters ( $k,\mu,\lambda$ ) are tunned throughout performing cross-validation (CV) on the training set. However, based on practical evidence (Sec. 4.6), having $0.5\leq\mu\leq 0.7$ and tuning $1\leq k\leq 5$ can lead to satisfactory performance. Furthermore, we advise the reader to tune ( $\mu,k$ ) first and find the optimal sparsity weight ( $\lambda$ ) afterward. The above strategy can significantly reduce the parameter search space. Likewise, we tune the hyper-parameters of the baselines based on CV on the training set.

4.4 Representation Learning

We perform our representation learning experiments on the selected image datasets, for which the base kernels are computed upon a set of image descriptors. To that aim, the distance $\mathcal{D}({\vec{x}}_{i}^{m},{\vec{x}}_{j}^{m})$ in Eq. (11) is computed as the Euclidean distance between $({\vec{x}}_{i},{\vec{x}}_{j})$ after applying the $m$ th descriptor to the data.

4.4.1 Caltech-101

For the Caltech-101 dataset, we adopt the following 10 different image descriptors with specifications explained in [7]: SIFT-Dist [45], SIFT-SPM [46], PHOG [47], C2-SWP [48], C2-ML [49], GB-Dist [50], GB, SS-Dist/SS-SPM [51], and GIST [52]. Table 1 reports the accuracies of the MKL methods for the Caltech-101 dataset. In addition to the multi-class MKL methods, we also included the accuracy rates for some of the published binary MKL techniques for this dataset such as SimpleMKL [16], Lasso-MKL [15], and GS-MKL [18]. Based on the results, LMMK algorithm outperforms all other baselines on the majority of the experiments. Its performance is $4\%$ higher than the best method (MKL-TR) when 30 training samples are used per class ( $N_{tr}=30$ ). Table 1 shows that the focus of LMMK on local separation of the classes was effective against the existing large intra-class variations in the Caltech-101 dataset. However, LMMK’s performance becomes comparable or slightly lower than the best methods when the per class training samples are sparse. In those cases, the neighborhood distributions do not coincide with the class labeling anymore, which is not a proper training condition for the algorithms relying on $k$ NN predictions.

Table 2 shows the normalized kernel weights assigned to each descriptor after implementation of LMMK on the Caltech-101 dataset ( $N_{tr}=30$ ). Besides, it includes the $k$ NN accuracies when using each base kernel individually, which approximately reveal the weak and strong descriptors for this dataset. Based on this table, LMMK generally assigned larger weights to the more discriminative descriptors (e.g., GB-DIS and SIFT-DIS). Additionally, its sparsity term eliminates the use of weak kernels (e.g., C2-SWP and PHOG) and also reduces the possible discriminative redundancies among the strong descriptors (e.g., GB and SIFT-SPM). However, our MKL algorithm still keeps GIST descriptor despite its mediocre quality. Therefore, we conclude that this descriptor provides an effective complement to other selected base kernels concerning local separation of the classes in the RKHS.

4.4.2 Pascal VOC 2007

As the descriptors for Pascal VOC 2007 dataset, we employ PHOG [47], DCSIFT/DSIFT [46], SS-Dist [51], and texture feature (Gabor feature [53]). In Table 3, the comparison of the classification accuracies on this dataset is provided, which also includes the published results of two binary MKL algorithms Canonical MKL [2] and GS-MKL [18]. For the Pascal dataset, the LMMK algorithm has a superior performance compared to the MKL baselines. This difference shows that the classes can be better discriminated locally compared to the global discrimination strategies used in other MKL methods. More precisely, LMMK shows $2.7\%$ and $14.6\%$ increase in accuracy compared to the best method (DMKL) and the $k$ NN-ave classifier.

4.4.3 Oxford Flowers17

We apply the following 6 descriptors for the Oxford Flowers17 dataset: DCSIFT [46], texture feature [53], SS-Dist [51], HOG [54], SIFT-Dist [45], and HSV color histogram. Based on the reported results in Table 3, both SVM-ave and $k$ NN-ave classifiers achieved similar performances using the original RKHS, while using LMMK method boosts $k$ NN performance to $93.8\%$ with a margin of $4.3\%$ compared to the best approach (MKL-TR). This observation implies that the intra-class variations have become much smaller in the RKHS resulted from LMMK compared to the original RKHS.

4.5 Feature Selection

In our second experimental scenario, we perform discriminative feature selection for MTS datasets using the selected MKL algorithms. To that purpose, each Gaussian feature-kernel ${\mathcal{K}}_{m}({\vec{x}}_{i},{\vec{x}}_{j})$ is computed upon the application of the global alignment kernel [55] on the $m$ th dimension of the input.

Note: There exist state-of-the-art algorithms specifically designed for the classification of MTS. They generally perform temporal segmentations or frame-based analysis of the data samples. Therefore, these algorithms do not belong to the intended multiple kernel scope of our experiments. In order to evaluate the feature selection performance of the selected baselines, besides the classification accuracy ( $Acc$ ), we also measure the number of selected features of the data (base kernels) via $\|\vec{\beta}\|_{0}$ . Consequently, a large $Acc$ along with a small $\|\vec{\beta}\|_{0}$ describes an ideal discriminative feature selection, in which the classes could be distinguished with high accuracy while using a few selected features.

Table 4 contains the implementation results of the MKL algorithms on the selected MTS benchmarks. The LMMK algorithm outperforms other MKL baselines regarding the classification accuracy. It leads to a $4.2\%$ increase in the value of $Acc$ for the UTKinect dataset while this margin is $1\%$ for the AUSLAN dataset. This observation shows that the local class-separation strategy is more effective against the data distribution in the first dataset. Also, it significantly increases the performance of $k$ NN method especially for the PEM dataset, in which $k$ NN-ave has a relatively low accuracy due to its large number of features (963). Nevertheless, LMMK optimization leads to a $15.7\%$ increase in the performance of $k$ NN for this dataset. Considering other baselines, DMKL and MKL-TR alternatively take the second position in classification accuracy, which shows that the discriminative effect of the low-rank model in MKL-TR may vary depending on the given dataset.

Regarding the feature selection performance, the value of $\|\vec{\beta}\|_{0}$ has ranked LMMK among the low-feature group of methods (DMKL, MKL-TR, LMMK), which is due to the direct application of an $l_{1}$ -norm sparsity term in the optimization scheme of Eq. (7). In comparison, DMKL and MKL-TR obtained smaller values for $\|\vec{\beta}\|_{0}$ in PEM and AUSLAN datasets respectively, but they showed lower $Acc$ in return. Therefore, we can claim that LMMK achieves more discriminative feature-selections even for these cases. To explain the feature selection results of other baselines, DMKL and MKL-TR use a convex combination constraint on $\vec{\beta}$ which directly enforces sparsity, while MKL-DR and DKL have quadratic constraints on the kernel weights which applies a weaker restriction on the number of non-zero kernel weights. On the other hand, KNMF-MKL and RMKL do not have any constraint in their optimization framework related to the sparseness of the selected features.

4.6 Effect of Hyper-parameters

In this section, we study the effect of the parameters ( $\lambda,k,\mu$ ) on the performance of LMMK. As described in Figure 2, we perform three experiments on Flowers17 and Pascal datasets, for each of which we study the algorithm’s performance by changing one of the above parameters while fixing the two others.

At first, we change $\lambda$ in the range $[0~{}~{}14]$ as in Figure 2-a. Based on the observations, we conclude that increasing the value of $\lambda$ leads to a stronger sparsity force in Eq. (7) and consequently results in a smaller set of selected features for both of the datasets. Figure 2-b shows that limited increases in $\lambda$ can improve the classification accuracies, but large values of $\lambda$ would damage the discriminative property of the resulted RKHS. It is essential to indicate that the points $\lambda=0$ in Figure 2-a and Figure 2-b are related to the performance of LMMKλ=0, which is the LMMK’s algorithm without having the sparsity term in Eq. (7). Based on the figures, LMMKλ=0 has the accuracies of $88.7\%$ and $63.9\%$ for Oxford and Pascal datasets, which are comparable to the performances of DMKL and MKL-TR (as the best baselines in Table 3). This evidence proves our claim regarding the effectiveness of focusing on local discrimination of the classes in the feature space even without the sparsity objective. Additionally, making a comparison between LMMKλ=0 and sparse LMMK reveals the notable benefit of the $l_{1}$ -norm sparsity term to both feature selection and classification accuracy.

Figure 2-c demonstrates the effect of the trade-off between the first two objective terms in Eq. (7). For the Pascal dataset, having a balance between the pulling and pushing terms (with $0.5\leq\mu\leq 0.6$ ) leads to the highest accuracy. However, for Flowers17, pushing the impostors away performs a more significant role in local discrimination of the classes (check for $0.5\leq\mu\leq 1$ ). Based on the experimental observations like the above, tuning $\mu$ around $0.5$ generally results in a good performance.

Based on the classification accuracy curves of Figure 2-d, the best choice for the value of $k$ depends on the distribution of the classes; nevertheless, selecting large values for this parameter (e.g., $10\leq k$ ) is expected to reduce the $Acc$ . As the explanation, by increasing the size of neighborhoods ( $k$ ), they cannot preserve their local property anymore.

5 Conclusion

In this work, we proposed a new multiple kernel algorithm to perform discriminative MKL for the multi-class problems. Our LMMK algorithm focuses on improving the local separation of the classes in the feature space. To that aim, we applied metric learning to the feature space by defining a diagonal multiple kernel metric in the RKHS. LMMK finds an efficient weighted combination of the base kernels using an LP optimization framework. Furthermore, we employed an $l_{1}$ -norm sparsity term in the formulation of LMMK to enforce the compactness in choosing the discriminative based kernels. We implemented our algorithm on the real-world multi-class benchmarks of images and multidimensional time-series. The evaluation results show that LMMK outperforms other MKL algorithms regarding representation learning and discriminative feature selection.

Acknowledgement

This research was supported by the Cluster of Excellence Cognitive Interaction Technology ’CITEC’ (EXC 277) at Bielefeld University, which is funded by the German Research Foundation (DFG).

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Hosseini and B. Hammer. Large-margin multiple kernel learning for discriminative features selection and representation learning. In 2019 International Joint Conference on Neural Networks (IJCNN) . IEEE, 2019.
2[2] Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In ICML’04 , 2004.
3[3] Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of machine learning research , 12(Jul):2211–2268, 2011.
4[4] Wenhao Jiang and Fu-lai Chung. A trace ratio maximization approach to multiple kernel-based dimensionality reduction. Neural Networks , 49:96–106, 2014.
5[5] Jieping Ye, Shuiwang Ji, and Jianhui Chen. Multi-class discriminant kernel learning via convex programming. Journal of Machine Learning Research , 9(Apr):719–758, 2008.
6[6] Saeid Niazmardi, Abdolreza Safari, and Saeid Homayouni. A novel multiple kernel learning framework for multiple feature classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens , 10:3734–3743, 2017.
7[7] Yen-Yu Lin, Tyng-Luh Liu, and Chiou-Shann Fuh. Multiple kernel learning for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence , 33(6):1147–1160, 2011.
8[8] Aroor Dinesh Dileep and C Chandra Sekhar. Representation and feature selection using multiple kernel learning. In IJCNN 2009 , pages 717–722. IEEE, 2009.