Reduced-Rank Local Distance Metric Learning for k-NN Classification

YInjie Huang; Cong Li; Michael Georgiopoulos; Georgios C.; Anagnostopoulos

arXiv:1902.08313·cs.LG·February 25, 2019

Reduced-Rank Local Distance Metric Learning for k-NN Classification

YInjie Huang, Cong Li, Michael Georgiopoulos, Georgios C., Anagnostopoulos

PDF

Open Access

TL;DR

This paper introduces a novel local distance metric learning method that uses sample similarity and conical combinations of metric matrices, improving k-NN classification performance.

Contribution

It presents a reduced-rank local metric learning approach with both transductive and inductive algorithms, enhancing efficiency and effectiveness over existing methods.

Findings

01

Notable performance improvements over recent metric learning methods.

02

Effective in small and large-scale classification tasks.

03

Demonstrates the advantage of local metrics in k-NN classification.

Abstract

We propose a new method for local distance metric learning based on sample similarity as side information. These local metrics, which utilize conical combinations of metric weight matrices, are learned from the pooled spatial characteristics of the data, as well as the similarity profiles between the pairs of samples, whose distances are measured. The main objective of our framework is to yield metrics, such that the resulting distances between similar samples are small and distances between dissimilar samples are above a certain threshold. For learning and inference purposes, we describe a transductive, as well as an inductive algorithm; the former approach naturally befits our framework, while the latter one is provided in the interest of faster learning. Experimental results on a collection of classification problems imply that the new methods may exhibit notable performance…

Tables4

Table 1. TABLE I: Classification accuracy versus λ 𝜆 \lambda for highly overlapping dataset.

$λ$	$0$	$0.1$	$1$	$10$	$100$	$1000$
Accuracy	$0.544$	$0.528$	$0.553$	$0.563$	$0.563$	$0.534$

Table 2. TABLE II: Classification accuracy versus λ 𝜆 \lambda for highly overlapping dataset with sparse features. The number of columns with all 0 0 entries for each metric is also reported.

$λ$	$0$	$0.1$	$1$	$10$	$10^{2}$	$10^{3}$	$10^{4}$	$10^{5}$
Accuracy	$0.725$	$0.813$	$0.747$	$0.713$	$0.725$	$0.975$	$0.916$	$0.488$
# of zero columns in Metric $1$	$0$	$0$	$0$	$0$	$0$	$13$	$13$	$0$
# of zero columns in Metric $2$	$0$	$0$	$0$	$0$	$0$	$13$	$13$	$60$
# of zero columns in Metric $3$	$0$	$0$	$0$	$0$	$0$	$13$	$14$	$60$

Table 3. TABLE III: Details of benchmark datasets. The columns indicate number of features (#D), classes (#classes), number of validation (#validation) and test (#test) samples.

	#D	#classes	#train	#validation	#test
A. Robot	$4$	$4$	$240$	$240$	$4976$
B. Letter	$16$	$26$	$520$	$2600$	$2600$
C. Pendigits	$16$	$10$	$400$	$2000$	$2000$
D. Wine Quality	$12$	$2$	$150$	$150$	$2898$
E. Telescope	$10$	$2$	$300$	$300$	$5400$
F. Image Segmentation	$18$	$7$	$210$	$210$	$1890$
G. Two Norm	$20$	$2$	$250$	$250$	$3900$
H. Ring Norm	$20$	$2$	$250$	$250$	$3900$
I. Ionosphere	$34$	$2$	$80$	$50$	$221$
J. Breast Tissue	$9$	$6$	$18$	$18$	$70$
K. COIL20	$30$	$20$	$400$	$400$	$640$
L. Glass	$9$	$6$	$18$	$18$	$178$
M. Heart	$13$	$2$	$40$	$40$	$190$
N. Isolet	$30$	$26$	$520$	$3640$	$3637$
O. Optdigits	$32$	$10$	$400$	$2400$	$2820$
P. Sonar	$60$	$2$	$40$	$40$	$180$
Q. USPS	$30$	$10$	$400$	$4500$	$4398$
R. WPBC	$33$	$2$	$20$	$20$	$158$

Table 4. TABLE IV: Percent accuracy results of 8 8 8 algorithms on 18 18 18 benchmark datasets. For each dataset, the statistically best and comparable results for a family-wise significance level of 0.05 0.05 0.05 are highlighted in boldface. All algorithms are ranked from best to worst; algorithms share the same rank, if their performance is statistically comparable.

	Euclidean	ITML	LMNN	LMNN-MM	GLML	PLML	T-R²LML	E-R²LML
A	${65.31}^{2 n d}$	${65.86}^{2 n d}$	${66.10}^{2 n d}$	${66.10}^{2 n d}$	${62.28}^{3 r d}$	${61.03}^{3 r d}$	${58.72}^{4 t h}$	${74.16}^{1 s t}$
B	${51.42}^{3 d r}$	${63.92}^{1 s t}$	${64.73}^{1 s t}$	${64.73}^{1 s t}$	${57.15}^{2 n d}$	${64.62}^{1 s t}$	${57.19}^{2 n d}$	${66.96}^{1 s t}$
C	${93.15}^{2 n d}$	${92.80}^{2 n d}$	${93.55}^{2 n d}$	${93.70}^{2 n d}$	${93.10}^{2 n d}$	${95.55}^{1 s t}$	${93.10}^{2 n d}$	${94.75}^{1 s t}$
D	${87.65}^{4 t h}$	${91.44}^{3 r d}$	${90.13}^{3 r d}$	${90.44}^{3 r d}$	${91.30}^{3 r d}$	${97.48}^{1 s t}$	${95.03}^{2 n d}$	${96.86}^{1 s t}$
E	${70.02}^{3 r d}$	${71.04}^{2 n d}$	${70.04}^{2 n d}$	${66.80}^{2 n d}$	${70.00}^{3 r d}$	${77.44}^{1 s t}$	${76.89}^{1 s t}$	${77.61}^{1 s t}$
F	${80.05}^{4 t h}$	${90.21}^{2 n d}$	${90.74}^{2 n d}$	${89.42}^{2 n d}$	${87.30}^{3 r d}$	${90.48}^{2 n d}$	${90.16}^{2 n d}$	${92.59}^{1 s t}$
G	${96.51}^{2 n d}$	${96.82}^{1 s t}$	${96.31}^{2 n d}$	${96.28}^{2 n d}$	${96.49}^{2 n d}$	${97.49}^{1 s t}$	${97.51}^{1 s t}$	${97.15}^{1 s t}$
H	${55.95}^{5 t h}$	${73.72}^{3 r d}$	${59.28}^{4 t h}$	${59.28}^{4 t h}$	${97.28}^{1 s t}$	${75.44}^{3 r d}$	${80.39}^{2 n d}$	${73.51}^{3 r d}$
I	${75.57}^{3 r d}$	${86.43}^{1 s t}$	${82.35}^{2 n d}$	${82.35}^{2 n d}$	${71.95}^{3 r d}$	${78.73}^{3 r d}$	${91.86}^{1 s t}$	${90.50}^{1 s t}$
J	${37.14}^{4 t h}$	${44.29}^{3 r d}$	${55.71}^{1 s t}$	${47.14}^{3 r d}$	${40.00}^{4 t h}$	${50.00}^{3 r d}$	${54.29}^{1 s t}$	${58.57}^{1 s t}$
K	${85.94}^{4 t h}$	${89.70}^{2 n d}$	${88.13}^{3 r d}$	${89.53}^{2 n d}$	${87.34}^{3 r d}$	${82.81}^{5 t h}$	${88.91}^{2 n d}$	${91.56}^{1 s t}$
L	${10.67}^{4 t h}$	${26.40}^{2 n d}$	${15.73}^{3 r d}$	${15.73}^{3 r d}$	${11.80}^{4 t h}$	${26.97}^{2 n d}$	${32.58}^{1 s t}$	${33.34}^{1 s t}$
M	${56.84}^{5 t h}$	${79.47}^{2 n d}$	${77.89}^{2 n d}$	${74.21}^{3 r d}$	${62.11}^{4 t h}$	${78.95}^{2 n d}$	${81.05}^{1 s t}$	${81.05}^{1 s t}$
N	${71.19}^{2 n d}$	${74.10}^{1 s t}$	${76.08}^{1 s t}$	${75.78}^{1 s t}$	${70.91}^{2 n d}$	${70.25}^{2 n d}$	${70.66}^{2 n d}$	${72.12}^{2 n d}$
O	${89.79}^{2 n d}$	${89.33}^{2 n d}$	${93.40}^{1 s t}$	${93.40}^{1 s t}$	${89.61}^{2} n d$	${88.30}^{2} n d$	${91.52}^{1 s t}$	${92.16}^{1 s t}$
P	${44.53}^{4 t h}$	${44.53}^{4 t h}$	${51.36}^{2 n d}$	${51.36}^{2 n d}$	${39.06}^{6 t h}$	${42.97}^{5 t h}$	${55.47}^{1 s t}$	${48.44}^{3 r d}$
Q	${88.09}^{3 r d}$	${90.79}^{1 s t}$	${89.22}^{2 n d}$	${89.43}^{3 r d}$	${88.45}^{3 r d}$	${90.95}^{1 s t}$	${89.90}^{2 n d}$	${90.79}^{1 s t}$
R	${36.08}^{4 t h}$	${44.94}^{3 r d}$	${39.24}^{3 r d}$	${32.91}^{4 t h}$	${53.17}^{2 n d}$	${41.77}^{3 r d}$	${67.72}^{1 s t}$	${55.06}^{2 n d}$

Equations140

\displaystyle\underset{\boldsymbol{A}\succeq 0}{min}\

\displaystyle\underset{\boldsymbol{A}\succeq 0}{min}\

\displaystyle s.t.\

L^{k} S, g^{k} \in Ω_{g}^{^{'}}, ξ_{m, n}^{k} \geq 0 min

L^{k} S, g^{k} \in Ω_{g}^{^{'}}, ξ_{m, n}^{k} \geq 0 min

+ C k \sum m, n \sum (1 - s_{mn}) ξ_{mn}^{k} + λ k \sum \mbox r ank (L^{k})

\displaystyle s.t.\

m, n \in N_{N + M}, k \in N_{K}

s_{mn} \in {0, 1}, m, n \in N_{M}

s_{mm} = 1, s_{mn} = s_{nm}, m, n \in N_{M}

n \in N_{N + M} \sum s_{mn} \geq 2, m \in N_{M},

\displaystyle\underset{\boldsymbol{L}^{k},\boldsymbol{S},\boldsymbol{g}^{k}\in\Omega^{{}^{\prime}}_{g}}{min}\

\displaystyle\underset{\boldsymbol{L}^{k},\boldsymbol{S},\boldsymbol{g}^{k}\in\Omega^{{}^{\prime}}_{g}}{min}\

+ C (1 - s_{mn}) [1 - L^{k} Δ x_{mn}_{2}^{2}]_{+}

+ λ k \sum L^{k}_{*}

\displaystyle s.t.\

s_{mm} = 1, s_{mn} = s_{nm}, m, n \in N_{M}

n \in N_{N + M} \sum s_{mn} \geq 2, m \in N_{M},

\displaystyle\underset{\boldsymbol{L}^{k},\boldsymbol{g}^{k}\in\Omega_{g}}{min}\

\displaystyle\underset{\boldsymbol{L}^{k},\boldsymbol{g}^{k}\in\Omega_{g}}{min}\

+ C (1 - s_{mn}) [1 - L^{k} Δ x_{mn}_{2}^{2}]_{+} + λ k \sum L^{k}_{*} .

\overset{s}{ˉ}_{mn}^{k} ≜ s_{mn} L^{k} Δ x_{mn}_{2}^{2}, m, n \in N_{N} .

\overset{s}{ˉ}_{mn}^{k} ≜ s_{mn} L^{k} Δ x_{mn}_{2}^{2}, m, n \in N_{N} .

\tilde{S} ≜ \overset{ˉ}{S}^{1} 0 ⋮ 0 0 \overset{ˉ}{S}^{2} ⋮ ... ... ... ⋱ 0 00 ⋮ \overset{ˉ}{S}^{K} \in R^{K N \times K N} .

\tilde{S} ≜ \overset{ˉ}{S}^{1} 0 ⋮ 0 0 \overset{ˉ}{S}^{2} ⋮ ... ... ... ⋱ 0 00 ⋮ \overset{ˉ}{S}^{K} \in R^{K N \times K N} .

\displaystyle\underset{\boldsymbol{g\in\Omega_{g}}}{min}\

\displaystyle\underset{\boldsymbol{g\in\Omega_{g}}}{min}\

\displaystyle\underset{\boldsymbol{g}\in\Omega_{g}}{min}\

\displaystyle\underset{\boldsymbol{g}\in\Omega_{g}}{min}\

\displaystyle\underset{\boldsymbol{g}}{min}\

\displaystyle\underset{\boldsymbol{g}}{min}\

\displaystyle s.t.\

g_{i}^{*} = \frac{1}{c} [(B^{T} α)_{i} - d_{i}]_{+}, i \in N_{K N},

g_{i}^{*} = \frac{1}{c} [(B^{T} α)_{i} - d_{i}]_{+}, i \in N_{K N},

L (g, α, β) = \frac{c}{2} g^{T} g + d^{T} g + α^{T} (1 - B g) - β^{T} g,

L (g, α, β) = \frac{c}{2} g^{T} g + d^{T} g + α^{T} (1 - B g) - β^{T} g,

g_{i} = \frac{1}{c} ((B^{T} α)_{i} + β_{i} - d_{i}), i \in N_{K N} .

g_{i} = \frac{1}{c} ((B^{T} α)_{i} + β_{i} - d_{i}), i \in N_{K N} .

ψ_{mn}

ψ_{mn}

\displaystyle\underset{\boldsymbol{S}}{min}\

\displaystyle\underset{\boldsymbol{S}}{min}\

\displaystyle s.t.\

s_{mm} = 1, s_{mn} = s_{nm}, m, n \in N_{M}

n \in N_{N + M} \sum s_{mn} \geq 2, m \in N_{M} .

∥ \partial f (w) ∥^{2} \leq A f (w) + G^{2}, ∥ \partial r (w) ∥^{2} \leq A r (w) + G^{2},

∥ \partial f (w) ∥^{2} \leq A f (w) + G^{2}, ∥ \partial r (w) ∥^{2} \leq A r (w) + G^{2},

t \in N_{T} min

t \in N_{T} min

\leq \frac{2 2 D G}{T ( 1 - \frac{c A D}{G 8 T} )} + \frac{f ( w ^{*} ) + r ( w ^{*} )}{1 - \frac{c A D}{G 8 T}} .

w_{t + \frac{1}{2}}

w_{t + \frac{1}{2}}

w_{t + 1}

\boldsymbol{0}\in\partial\Bigg{\{}\frac{1}{2}\left\|\boldsymbol{w}-\boldsymbol{w}_{t+\frac{1}{2}}\right\|^{2}+\eta r(\boldsymbol{w})\Bigg{\}}\Bigg{|}_{\boldsymbol{w}=\boldsymbol{w}_{t+1}}.

\boldsymbol{0}\in\partial\Bigg{\{}\frac{1}{2}\left\|\boldsymbol{w}-\boldsymbol{w}_{t+\frac{1}{2}}\right\|^{2}+\eta r(\boldsymbol{w})\Bigg{\}}\Bigg{|}_{\boldsymbol{w}=\boldsymbol{w}_{t+1}}.

0 \in w_{t + 1} - w_{t} + η g_{t}^{f} + η \partial r (w_{t + 1}) .

0 \in w_{t + 1} - w_{t} + η g_{t}^{f} + η \partial r (w_{t + 1}) .

0 = w_{t + 1} - w_{t} + η g_{t}^{f} + η g_{t + 1}^{r} .

0 = w_{t + 1} - w_{t} + η g_{t}^{f} + η g_{t + 1}^{r} .

w_{t + 1} = w_{t} - η g_{t}^{f} - η g_{t + 1}^{r} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Human Mobility and Location-Based Analysis · Text and Document Classification Technologies

Full text

††thanks: Y. Huang, C. Li, M. Georgiopoulos was with Department of Electrical Engineering & Computer Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, Florida, 32816, USA††thanks: G. C. Anagnostopoulos was with Department of Electrical and Computer Engineering, Florida Institute of Technology, 150 W University Blvd, Melbourne, Florida, 32901, USA

Reduced-Rank Local Distance Metric Learning

for k-NN Classification

Yinjie Huang

Cong Li

Michael Georgiopoulos

Georgios C. Anagnostopoulos

(Received: date / Accepted: date)

Abstract

We propose a new method for local distance metric learning based on sample similarity as side information. These local metrics, which utilize conical combinations of metric weight matrices, are learned from the pooled spatial characteristics of the data, as well as the similarity profiles between the pairs of samples, whose distances are measured. The main objective of our framework is to yield metrics, such that the resulting distances between similar samples are small and distances between dissimilar samples are above a certain threshold. For learning and inference purposes, we describe a transductive, as well as an inductive algorithm; the former approach naturally befits our framework, while the latter one is provided in the interest of faster learning. Experimental results on a collection of classification problems imply that the new methods may exhibit notable performance advantages over alternative metric learning approaches that have recently appeared in the literature111A preliminary version of the work presented here has appeared in Huang et al (2013)..

I Introduction

Distance computations underlie many machine learning approaches with the $k$ -nearest neighbor (KNN) decision rule for classification and the $k$ -Means algorithm for clustering problems being the two most prominent examples. Such computations are often, if not mainly, performed using the ordinary Euclidean metric or a weighted variation of it, namely the Mahalanobis distance. However, employing fixed, global metrics, such as the ones just mentioned, for computing distances may not yield good results in all settings. This fact motivated many researchers to pursue data-driven approaches, in order to infer the best metric for a given problem (e.g. Xing et al (2002) and Shalev-Shwartz et al (2004)). In successfully addressing this task, one needs to take into account the data’s distributional characteristics and to take advantage of any side information that may be available for the data. In general, such approaches are referred to as metric learning. A typical instance of such an approach is to learn the weight matrix of the Mahalanobis metric, which occasionally we will refer to it simply as the metric. Equivalently, this task could be viewed as follows: a de-correlating linear transformation of the data is learned in the native space and Euclidean distances are computed in the range space of the learned linear transform (feature space). When dealing with a classification problem, a KNN algorithm based on the learned metric is eventually employed to label samples.

Our work falls under the metric learning approaches for classification tasks, where the Mahalanobis metric is learned through the help of pair-wise sample similarities. By assumption, two samples will be similar, if they feature the same class label. The goal of similarity-based metric learning is to map similar samples close and to map dissimilar samples far apart in the feature space. After learning this metric, an eventual application of a KNN decision rule exhibits improved performance over a direct application of the same rule using the Euclidean metric.

Many metric learning algorithms have been proposed and show significant improvements over the Euclidean KNN rule. For example, in Xing et al (2002), the authors posed similarity-based metric learning as a convex optimization problem, which is employed in a clustering problem. A projected gradient ascent algorithm is utilized to optimize the problem. Shalev-Shwartz et al (2004) described an online algorithm for supervised learning of metrics. Their algorithm is based on successive projections onto the positive semi-definite cone. They also offered a dual version of the algorithm which is able to incorporate kernel operators. Moreover, Neighborhood Components Analysis (NCA) Goldberger et al (2004), maximizes the leave-one-out performance on the training data based on stochastic nearest neighbors. Their classification model is non-parametric, making no assumptions about the shape of the class distributions. Chopra et al (2005) built a system that maps images to points in a lower dimensional space so that these points lie closer, if the original images are similar. This model consists of two convolutional neural networks to address geometric distortions. Furthermore, Large Margin Nearest Neighbor (LMNN) Weinberger et al (2006) is trying to learn the metric so that the $k$ -nearest neighbors of each sample belong to the same class, while others are separated by a large margin. They cast their optimization as an instance of semi-definite programming. Finally, Davis et al (2007) formulated the problem using information entropy and introduce Information Theoretic Metric Learning (ITML). ITML tries to minimize the differential relative entropy between two multivariate Gaussian distributions with distance metric constraints.

The previous metric learning approaches share one common feature: they employ a single, global metric, i.e., a metric that is used for all distance computations. However, this global metric learning approach may not be well-suited to some multi-modal or non-linear scenarios. Figure 1 illustrates this point via a toy dataset containing $4$ samples from two classes. Note that this toy problem is merely a conceptual device that shows the comparison of what a global metric and local metrics will do. Figure 1(a) shows the samples in their native space. Figure 1(b) shows the feature space resulting from learning a global metric, while Figure 1(c) shows the transformed data after learning two local metrics, which take into account the location and similarity characteristics of the data involved. We refer to such metrics as local metrics. In contrast to the result obtained using a global metric, local metrics can map similar samples closer to each other, as shown in Figure 1(c). This may potentially improve $1$ -NN classification performance, when compared to the sample distributions in the other two cases.

Many local metric learning algorithms have been proposed. In Hastie and Tibshirani (1996), local metrics are determined from centroid information. The neighborhoods are shrank in directions that are orthogonal to the local decision boundaries, while elongated in directions parallel to the boundaries. In Bilenko et al (2004), the authors introduced a clustering framework, in which a local metric is defined for each cluster. Yang et al (2006) proposed a local metric learning model that generates distance metrics to accommodate multiple modes for each class. Moreover, an Expectation-Maximization-like algorithm is employed to solve their probabilistic framework. In Weinberger and Saul (2008), the authors of LMNN developed the LMNN-Multiple Metric (LMNN-MM) approach. When applied in a classification context, the number of metrics equals the number of classes. Additionally, Noh et al (2010) proposed Generative Local Metric Learning (GLML), which learns local metrics through NN classification error minimization. GLML assumes that the data has been drawn from a Gaussian mixture, which is a rather strong assumption. Eventually, Wang et al (2012) proposed Parametric Local Metric Learning (PLML), in which each local metric is defined in relation to an anchor point of the instance space. In order to solve their local metric problem, they employ a projected gradient method to optimize their large-margin objective. Zhu et al (2014)’s model learns multiple distance metrics under different scales of the data and combine the decisions from these learned metrics. Finally, they formulated the local metric learning problem as a Support Vector Machine (SVM) model.

In this paper, we propose a new local metric learning approach, which we will refer to as Reduced-Rank Local Metric Learning (R2LML). As elaborated in Section II, in our approach, the local Mahalanobis metric (in specific, its weight matrix) is modeled as a conical combination of positive semi-definite weight matrices. With the assistance of pair-wise similarities, both the weight matrices and their coefficients are learned from the data. The weight matrices themselves correspond to local linear transformations of the original data from their native space into a locality-dependent feature space. These transformations are learned such that similar (dissimilar) samples map close to (far from) each other, so that they exhibit small (large) pair-wise Euclidean distances in these locally-defined feature spaces. Note that, in our case, we will consider samples to be similar, if they share the same label. Moreover, we will consider two variants of R2LML. The first one, namely Transductive Reduced-Rank Local Metric Learning (T-R2LML), uses transductive learning Vapnik (1998) to infer the test sample coefficients necessary for defining the local metrics. The second one, which is referred to as Efficient Reduced-Rank Local Metric Learning (E-R2LML), aims to address the computationally intensive nature of the first variant. As discussed in Section II, it employs a technique first used in Wang et al (2012), according to which the coefficients of a test sample are set equal to the ones of its nearest (in terms of Euclidean distance) training sample. Finally, it is worth mentioning that both variants employ a sum-of-nuclear-norms regularizer to avoid over-fitting, when warranted.

In order to optimize the aforementioned formulations, two efficient Block Coordinate Descent (BCD) algorithms are presented in Section III. In specific, as delineated in Section III-A, a two-block minimization algorithm is able to solve the E-R2LML learning problem. The first block minimization with respect to the weight matrices constitutes a Proximal Subgradient Descent (PSD) step, which is able to cope with the non-smooth nature of the formulation’s regularizer. The second block minimization, which attempts to optimize the metric coefficients, constitutes a straightforward Majorization Minimization (MM) step. On the other hand, the algorithm intended for solving the T-R2LML formulation differs from the first one in that it includes an additional block minimization with respect to the test samples’ similarities. As shown in Section III-B, the relevant optimization, while addressing a binary integer programming problem, can be efficiently performed. The convergence analysis for both methods is showcased in Section III-C.

Finally, in Section IV, the first experiment studies the importance of regularization in the proposed frameworks based on the synthetic datasets. Additionally, the relationship between the number of local metrics and the accuracies is highlighted in the second experiment. Eventually, we demonstrate the capabilities of T-R2LML and E-R2LML with respect to classification tasks. When compared to other recent global or local metric learning approaches, T-R2LML and E-R2LML achieve the highest classification accuracy in $9$ and $14$ out of $18$ datasets respectively.

II Problem Formulation

Define $\mathbb{N}_{M}\triangleq\{1,2,\ldots,M\}$ for any positive integer $M$ . Suppose we have $n$ input training set $\{\boldsymbol{x}_{n}\in\mathbb{R}^{D}\}_{n\in\mathbb{N}_{N}}$ and an accompanying similarity matrix $\boldsymbol{S}\in\left\{0,1\right\}^{N\times N}$ as side information, in which each entry represents a corresponding pair-wise sample similarity. If $\boldsymbol{x}_{m}$ and $\boldsymbol{x}_{n}$ are similar, then $s_{mn}=1$ ; otherwise, then $s_{mn}=0$ . In a classification context, two samples from the same (or different) class can be naturally deemed similar (or dissimilar).

The Mahalanobis distance between two samples $\boldsymbol{x}_{n}$ and $\boldsymbol{x}_{m}$ is $d_{\boldsymbol{A}}(\boldsymbol{x}_{m},\boldsymbol{x}_{n})$ $\triangleq\sqrt{(\boldsymbol{x}_{m}-\boldsymbol{x}_{n})^{T}\boldsymbol{A}(\boldsymbol{x}_{m}-\boldsymbol{x}_{n})}$ . We will refer to $\boldsymbol{A}\in\mathbb{R}^{D\times D}$ (a positive semi-definite matrix, denoted as $\boldsymbol{A}\succeq 0$ ) as the weight matrix of the metric. When $\boldsymbol{A}=\boldsymbol{I}$ , the previous metric, obviously, becomes the Euclidean distance metric. Since any positive semi-definite weight matrix can be expressed as $\boldsymbol{A}=\boldsymbol{L}^{T}\boldsymbol{L}$ , where $\boldsymbol{L}\in\mathbb{R}^{P\times D}$ with $P\leq D$ , the previously defined Mahalanobis distance can be expressed as $d_{\boldsymbol{A}}(\boldsymbol{x}_{m},\boldsymbol{x}_{n})=\left\|\boldsymbol{L}(\boldsymbol{x}_{m}-\boldsymbol{x}_{n})\right\|_{2}$ . This last expression implies that the Mahalanobis distance based on $\boldsymbol{A}$ between two points in the native space can be viewed as the Euclidean distance between the corresponding points in the feature space obtained through the linear transformation $\boldsymbol{L}$ .

Metric learning approaches are trying to learn $\boldsymbol{A}$ so to minimize the distances between pairs of similar points, while maximizing, or maintaining above a certain threshold, the distances between dissimilar points in the feature space. The problem can be formulated as follows:

[TABLE]

Problem (1) is a semi-definite programming problem involving a global metric based on $\boldsymbol{A}$ . Several approaches like LMNN, ITML and NCA are learning a single global metric. However, as argued earlier via Figure 1, a global metric may not be advantageous under all circumstances.

In this paper, we propose R2LML, a new local metric approach. We assume that the metric involved is expressed as a conical combination of $K\geq 1$ Mahalanobis metrics. The metric between $\boldsymbol{x}_{n}$ and $\boldsymbol{x}_{m}$ is defined as $\sum_{k}\boldsymbol{A}^{k}g^{k}_{m}g^{k}_{n}$ . Here, $\boldsymbol{g}^{k}\in\mathbb{R}^{N}$ is a vector for each local metric $k$ , of which the $n^{th}$ element $g^{k}_{n}$ may be considered as a measure of how pertinent the $k$ th metric is, when computing distances involving the $n^{th}$ sample. Not only do these metrics change throughout the input space along the data’s underlying manifold, but are also affected by the similarity of nearby samples. Note that these coefficient vectors will be also unknown for test samples and, hence, need to be inferred as well. A natural avenue to achieve this is via a transductive learning scheme.

The metric $\sum_{k}\boldsymbol{A}^{k}g^{k}_{m}g^{k}_{n}$ is actually a semi-metric Sefer and Kingsford (2011), which violates the triangle inequality. When choosing $\boldsymbol{g}$ properly, there exists triplets of samples that does not satisfy the triangle inequality in the feature space. However, in our experiments, it seems that a proper metric is almost always learned. For example, when considering the Pendigits dataset (containing about $200$ samples), the triangle inequalities that we examined (over one million) were all satisfied. In the rest of our work, we still refer this semi-metric as metric for simplicity.

Transductive learning trains both labeled and unlabeled data to yield improved performance. According to Vapnik (1998), when solving a problem, one should avoid inferring a function as an intermediate step. There are many transductive learning approaches proposed for various algorithms. In Bennett (1999), Chen et al (2002), Gammerman et al (2013) and Joachims (1999), the authors developed transductive learning framework for Support Vector Machine. Joachims (2003) and Kukar et al (2002) designed transductive algorithm for KNN classifiers and general classifiers respectively. There are also transductive learning approaches for graph-based models in Talukdar and Crammer (2009), Liu and Chang (2009) and Zhou and Burges (2007).

In T-R2LML, the input training set $\{\boldsymbol{x}_{n}\in\mathbb{R}^{D}\}_{n\in\mathbb{N}_{N}}$ and test set $\{\boldsymbol{x}_{n}\in\mathbb{R}^{D}\}_{n\in\mathbb{N}_{M}}$ are combined. Since labels of test samples are unknown, the entries of the similarity matrix $\boldsymbol{S}\in\left\{0,1\right\}^{(N+M)\times(N+M)}$ that involve test data are randomly initialized. The vectors $\boldsymbol{g}^{k}$ belong to $\Omega^{{}^{\prime}}_{g}\triangleq\left\{\left\{\boldsymbol{g}_{k}\right\}_{k\in\mathbb{N}_{K}}\in\left[0,1\right]^{N+M}:\boldsymbol{g}^{k}\succeq\boldsymbol{0},\ \sum_{k}\boldsymbol{g}^{k}=\boldsymbol{1}\right\}$ , where ’ $\succeq$ ’ denotes component-wise ordering. The $\boldsymbol{g}^{k}$ s’ need to sum up to the all-ones vector $\boldsymbol{1}$ , so that at least one metric is relevant, when computing distances from each sample. Obviously, if $K=1$ , $\boldsymbol{g}^{1}=\boldsymbol{1}$ , which amounts to learning a single global metric.

Based on the previous description, the weight matrix for each pair $(m,n)$ is defined as $\sum_{k}\boldsymbol{A}^{k}g^{k}_{m}g^{k}_{n}$ . Note that the distance between every pair of points features a different weight matrix. We now consider the following formulation motivated by Problem (1), which varies over $k\in\mathbb{K}$ :

[TABLE]

where $\Delta\boldsymbol{x}_{mn}\triangleq\boldsymbol{x}_{m}-\boldsymbol{x}_{n}$ and $\mbox{rank}(\boldsymbol{L}^{k})$ denotes the rank of matrix $\boldsymbol{L}^{k}$ . In the objective function, the first term attempts to minimize the distance between similar samples, while the second term along with the first set of soft constraints (due to the slack variables $\xi^{k}_{mn}$ ) encourage distances between pairs of dissimilar samples to be larger than $1$ . Evidently, $C>0$ controls the penalty of violating the previous prerequisite. Finally, the last term penalizes large ranks of the linear transformations $\boldsymbol{L}^{k}$ . Therefore, the regularization parameter $\lambda\geq 0$ essentially controls the dimensionality of the feature space. As is typical for identifying good values for regularization parameters, both $C$ and $\lambda$ are chosen via a validation procedure. Note that the diagonal elements are all set to $1$ in the similarity matrix. Finally, the last constraint guarantees that the testing samples include all the labels of the training set.

Via the use of the hinge function, $[u]_{+}\triangleq\max\{u,0\}$ for all $u\in\mathbb{R}$ , Problem (2) can be reformulated by eliminating the slack variables. Notice that $\mbox{rank}(\boldsymbol{L}^{k})$ is a non-convex function w.r.t. $\boldsymbol{L}^{k}$ and, hence, is hard to optimize. Following the approaches of Candès and Tao (2009) and Candès and Recht (2008), $\mbox{rank}(\boldsymbol{L}^{k})$ can be replaced with its convex envelope, i.e., $\boldsymbol{L}^{k}$ ’s nuclear norm. The new problem is now formulated as:

[TABLE]

where $\left\|\cdot\right\|_{*}$ denotes the nuclear norm, in specific, $\left\|\boldsymbol{L}^{k}\right\|_{*}\triangleq\sum_{s=1}^{P}\sigma_{s}(\boldsymbol{L}^{k})$ , where $\sigma_{s}$ is a singular value of $\boldsymbol{L}^{k}$ .

A shortcoming of T-R2LML is that, it is computationally intensive, since the computation of the gradient in each step requires $O(K(M+N)^{2})$ operations and, typically, $M>>N$ . Hence, we are also inclined to consider a faster, albeit approximate, approach to address our local metric learning problem. In specific, as done in Wang et al (2012), for each test sample $\boldsymbol{x}$ , its $\boldsymbol{g}$ vector will be assigned the value of the corresponding vector associated to $\boldsymbol{x}$ ’s nearest (in terms of Euclidean distance) training sample. We refer to this model as E-R2LML and its training only requires $O(KN^{2})$ operations per step.

For E-R2LML, $\boldsymbol{g}^{k}$ belongs to $\Omega_{g}\triangleq\left\{\left\{\boldsymbol{g}_{k}\right\}_{k\in\mathbb{N}_{K}}\in\left[0,1\right]^{N}:\boldsymbol{g}^{k}\succeq\boldsymbol{0},\ \sum_{k}\boldsymbol{g}^{k}=\boldsymbol{1}\right\}$ when considering only the training set. Finally, the problem becomes:

[TABLE]

III Algorithm

Problem (4) and Problem (3) reflect minimizations over two and three sets of variables respectively. In E-R2LML, for fixed $\boldsymbol{g}^{k}$ , the problem is non-convex w.r.t. $\boldsymbol{L}^{k}$ , since the second term in Eq. (4) is the combination of a convex function (hinge function) and a non-monotone function w.r.t. $\boldsymbol{L}^{k}$ , namely $1-\left\|\boldsymbol{L}^{k}\Delta\boldsymbol{x}_{mn}\right\|^{2}_{2}$ . On the other hand, the problem is also non-convex w.r.t $\boldsymbol{g}^{k}$ for fixed $\boldsymbol{L}^{k}$ , since the similarity matrix $\boldsymbol{S}$ is almost always indefinite, which will be argued in the sequel. Thus, the objective function may have multiple minima and an iterative procedure to minimize it may have to be initialized multiple times with different values for the unknown parameters in order to find a good solution. The same observations apply to T-R2LML as well. Finally, notice that, for T-R2LML, when optimizing Problem (3) w.r.t. $\boldsymbol{S}$ , while holding $\boldsymbol{g}^{k}$ and $\boldsymbol{L}^{k}$ fixed, the problem under consideration is convex. In what follows next, we discuss two training algorithms: a two-block BCD algorithm for E-R2LML and a very similar BCD algorithm for T-R2LML that can perform the optimizations in question.

III-A Two-Block Algorithm for E-R2LML

We first start off with a discussion of the BCD that trains the E-R2LML framework. For the first block, we try to solve for every $\boldsymbol{L}^{k}$ by holding the $\boldsymbol{g}^{k}$ ’s fixed. In this case, Problem (4) becomes an unconstrained minimization problem, which can be expressed in the form $f(\boldsymbol{w})+r(\boldsymbol{w})$ , where $\boldsymbol{w}$ is the parameter we are trying to minimize over (in our case, all $\boldsymbol{L}^{k}$ ’s). $f(\boldsymbol{w})$ is the non-differentiable hinge loss function, while $r(\boldsymbol{w})$ is a non-smooth, convex regularization term. Hence, we resort to using a PSD method in a similar fashion as has been done in Rakotomamonjy et al (2011) and Chen et al (2009). It might be worth noting that the particular approach is a special case of the one presented in Duchi and Singer (2009). It is this relationship that we leverage to develop the convergence analysis of our PSD steps in Section III-C.

Next, for the second block we minimize w.r.t. each $\boldsymbol{g}^{k}$ vector, while the $\boldsymbol{L}^{k}$ ’s are assumed to be fixed. Consider a matrix $\boldsymbol{\bar{S}}^{k}$ associated to the $k^{th}$ metric, whose $(m,n)$ element is defined as:

[TABLE]

Then, by concatenating all individual $\boldsymbol{g}^{k}$ vectors into a single vector $\boldsymbol{g}\in\mathbb{R}^{KN}$ and by defining the block-diagonal matrix $\boldsymbol{\tilde{S}}$ as:

[TABLE]

Problem (4) can be expressed as:

[TABLE]

where $\Omega_{g}=\left\{\boldsymbol{g}\in\left[0,1\right]^{KN}:\boldsymbol{g}\succeq\boldsymbol{0},\ \boldsymbol{B}\boldsymbol{g}=\boldsymbol{1}\right\}$ , $\boldsymbol{B}\triangleq\boldsymbol{1}^{T}\otimes\boldsymbol{I}_{N}$ and $\otimes$ denotes the Kronecker product. Problem (7) is non-convex, since $\boldsymbol{\tilde{S}}$ is almost always indefinite. This stems from the fact that $\boldsymbol{\tilde{S}}$ is a block diagonal matrix, whose blocks are Euclidean Distance Matrices (EDMs). EDMs feature exactly one positive eigenvalue (unless all of them equal to [math]). Since each EDM is a hollow matrix, its trace equals to [math], which implies that its remaining eigenvalues must be negative Balaji and Bapat (2007). Therefore, $\boldsymbol{\tilde{S}}$ will feature negative eigenvalues.

In order to minimize Problem (7), we employ a MM approach Hunter and Lange (2004), which requires first identifying a function of $\boldsymbol{g}$ that majorizes the objective function at hand. Let $\mu\triangleq-\lambda_{max}(\boldsymbol{\tilde{S}})$ , where $\lambda_{max}(\boldsymbol{\tilde{S}})$ is the largest eigenvalue of $\boldsymbol{\tilde{S}}$ . Since $\boldsymbol{\tilde{S}}$ is indefinite, $\lambda_{max}(\boldsymbol{\tilde{S}})>0$ . Then, $\boldsymbol{H}\triangleq\boldsymbol{\tilde{S}}+\mu\boldsymbol{I}$ is negative semi-definite. Let $q(\boldsymbol{g})\triangleq\boldsymbol{g}^{T}\boldsymbol{\tilde{S}}\boldsymbol{g}$ be the cost function of Eq. (7). Note that $(\boldsymbol{g}-\boldsymbol{g}^{\prime})^{T}\boldsymbol{H}(\boldsymbol{g}-\boldsymbol{g}^{\prime})\leq 0$ for any $\boldsymbol{g}$ and $\boldsymbol{g}^{\prime}$ and we have that $q(\boldsymbol{g})<-\boldsymbol{g}^{\prime T}\boldsymbol{H}\boldsymbol{g}^{\prime}+2\boldsymbol{g}^{\prime T}\boldsymbol{H}\boldsymbol{g}-\mu\left\|\boldsymbol{g}\right\|^{2}_{2}$ for all $\boldsymbol{g}\neq\boldsymbol{g}^{\prime}$ and equality, only if $\boldsymbol{g}=\boldsymbol{g}^{\prime}$ . The right hand side of the aforementioned inequality constitutes $q$ ’s majorizing function, denoted as $q(\boldsymbol{g}|\boldsymbol{g}^{\prime})$ . The majorizing function is used to iteratively optimize $\boldsymbol{g}$ based on the current estimate $\boldsymbol{g}^{\prime}$ . So we have the following minimization problem, which is convex w.r.t $\boldsymbol{g}$ :

[TABLE]

This problem is readily solvable, as the next theorem implies.

Theorem 1.

Let $\boldsymbol{g},\boldsymbol{d}\in\mathbb{R}^{KN}$ , $\boldsymbol{B}\triangleq\boldsymbol{1}^{T}\otimes\boldsymbol{I}_{N}\in\mathbb{R}^{N\times KN}$ and $c>0$ . The unique minimizer $\boldsymbol{g}^{*}$ of

[TABLE]

has the form

[TABLE]

where $g_{i}$ is the $i^{th}$ element of $\boldsymbol{g}$ and $\boldsymbol{\alpha}\in\mathbb{R}^{N}$ is the Lagrange multiplier vector associated to the equality constraint.

Proof.

The Lagrangian of Problem (9) is formulated as:

[TABLE]

where $\boldsymbol{\alpha}\in\mathbb{R}^{N}$ and $\boldsymbol{\beta}\in\mathbb{R}^{KN}$ with $\boldsymbol{\beta}\succeq\boldsymbol{0}$ are Lagrange multiplier vectors. If we set the partial derivative of $L(\boldsymbol{g},\boldsymbol{\alpha},\boldsymbol{\beta})$ with respect to $\boldsymbol{g}$ to $\boldsymbol{0}$ , we readily have

[TABLE]

Let $\gamma_{i}\triangleq(\boldsymbol{B}^{T}\boldsymbol{\alpha})_{i}-d_{i}$ . Combining Eq. (12) with the complementary slackness condition $\beta_{i}g_{i}=0$ , one obtains that, if $\gamma_{i}\leq 0$ , then $\beta_{i}=-\gamma_{i}$ and $g_{i}=0$ , while, when $\gamma_{i}>0$ , then $\beta_{i}=0$ and, evidently, $g_{i}=\frac{1}{c}\gamma_{i}$ . These two observations can be summarized as $g_{i}=\frac{1}{c}\left[\gamma_{i}\right]_{+}$ , which completes the proof. ∎

In order to exploit the result of Theorem 1 for obtaining a concrete solution to Problem (8), a binary search is employed to find the (unknown) optimal values of the Lagrange multipliers $\alpha_{i}$ , so they satisfy the equality constraint $\boldsymbol{B}\boldsymbol{g}=\boldsymbol{1}$ .

In conclusion, the entire algorithm for solving Problem (4) is depicted in Algorithm 1 and can be recapitulated as follows: for the first block, the $\boldsymbol{g}^{k}$ vectors are assumed fixed and a PSD step is employed to minimize the cost function of Eq. (4) w.r.t. each weight matrix $\boldsymbol{L}^{k}$ . In the second block, all $\boldsymbol{L}^{k}$ ’s are held fixed to the values obtained from step $1$ and the solution offered by Theorem 1 along with binary search solutions for the $\alpha_{i}$ ’s are used to compute the optimal $\boldsymbol{g}_{k}$ ’s by iteratively solving Problem (8) via a MM scheme. These two main blocks are repeated until convergence.

III-B The Three-Block Algorithm Variant for T-R2LML

The first two BCD steps of T-R2LML are identical to the ones of E-R2LML. However, since T-R2LML embodies a trasductive learning approach, a third BCD step is required, in order to predict the similarities between all samples, including the ones used for testing. In specific, for the third block optimization, Problem (3) is minimized over $\boldsymbol{S}$ for fixed $\boldsymbol{L}^{k}$ ’s and $\boldsymbol{g}^{k}$ ’s. By defining

[TABLE]

Problem (3) becomes:

[TABLE]

where $\boldsymbol{\Psi}\in\mathbb{R}^{M\times N}$ is the matrix with elements $\psi_{mn}$ . This is a $0-1$ integer programming problem. By scanning the matrix $\boldsymbol{\Psi}$ row by row, Problem (14) will be optimally solved using the following rules:

•

For rows of $\boldsymbol{\Psi}$ containing at least one negative element, set the corresponding $s_{mn}$ element(s) to $1$ ; the remaining elements are set to [math].

•

For rows of $\boldsymbol{\Psi}$ with no negative element, the $s_{mn}$ element, which corresponds to the smallest $\psi_{mn}$ , is set to $1$ ; the remaining elements are set to [math].

•

Note that $s_{nm}$ must equal $s_{mn}$ , since the matrix $\boldsymbol{S}$ is symmetric.

For the sake of completeness, the relevant algorithm is summarized in Algorithm 1. Note that these three main blocks are repeated until a preset maximum number of steps is reached.

III-C Analysis

In this subsection, we investigate the convergence of our proposed Algorithm 1. This is a local analysis since our framework is non-convex. As mentioned in previous sections, a PSD approach is used to minimize the function $f(\boldsymbol{w})+r(\boldsymbol{w})$ , where both $f\triangleq\sum_{k}\sum_{m,n}s_{mn}\left\|\boldsymbol{L}^{k}\Delta\boldsymbol{x}_{mn}\right\|^{2}_{2}g_{n}^{k}g_{m}^{k}+C(1-s_{mn})\left[1-\left\|\boldsymbol{L}^{k}\Delta\boldsymbol{x}_{mn}\right\|^{2}_{2}\right]_{+}$ and $r\triangleq\sum_{k}\left\|\boldsymbol{L}^{k}\right\|_{*}$ are non-differentiable. Denote $\partial f$ as the subgradient of $f$ and define $\left\|\partial f(\boldsymbol{w})\right\|\triangleq\sup_{\boldsymbol{g}\in\partial f(\boldsymbol{w})}\left\|\boldsymbol{g}\right\|_{2}$ ; the corresponding quantities for $r$ are similarly defined. Like in Langford et al (2009) and Shalev-Shwartz and Tewari (2011), the subgradients are assumed to be bounded, i.e.:

[TABLE]

where $A$ and $G$ are positive scalars. Let $\boldsymbol{w}^{*}$ be the minimizer of $f(\boldsymbol{w})+r(\boldsymbol{w})$ . Then we have the following theorem for the problem under consideration.

Theorem 2.

Suppose that a PSD method is employed to solve $\min_{\boldsymbol{w}}\{f(\boldsymbol{w})+r(\boldsymbol{w})\}$ . Assume that 1) $f$ and $r$ are lower-bounded; 2) the norms of any subgradients $\partial f$ and $\partial r$ are bounded as in Eq. (15); 3) $\left\|\boldsymbol{w}^{*}\right\|\leq D$ for some $D>0$ ; 4) $r(\boldsymbol{0})=0$ . Let $\eta_{t}\triangleq\frac{D}{\sqrt{8T}G}$ , where $T$ is the number of iterations of the PSD algorithm. Then, for a constant $c\leq 4$ , such that $(1-cA\frac{D}{\sqrt{8T}D})>0$ , and initial estimate of the solution $\boldsymbol{w}_{1}=\boldsymbol{0}$ , we have:

[TABLE]

The detailed proof of Theorem 2 is given in the Appendix A. Theorem 2 implies that, as $T$ grows, the PSD iterates approach $\boldsymbol{w}^{*}$ .

Theorem 3.

Algorithm 1 yields a convergent, non-increasing sequence of cost function values relevant to Problem (4) and Problem (3). Furthermore, the set of fixed points of the iterative map embodied by Algorithm 1 include the Karush-Kuhn-Tucker (KKT) points of Problem (4) and Problem (3).

The proof is showcased in Appendix B. Theorem 3 implies the convergence of the two proposed algorithms.

IV Experiments

IV-A Effects of our nuclear norm-based regularization

Since T-R2LML involves $(M-1)M/2+MN+(D^{2}+M+N)K$ parameters, while E-R2LML employs $(D^{2}+N)K$ , we can see that both R2LML frameworks may benefit from regularization, when confronted with scarce, high-dimensional, noisy data. Two synthetic datasets were created to study the effects of nuclear norm regularization.

The first set consisted of $30$ -dimensional samples, while the second one consisted of $60$ -dimensional features. In both cases, samples were drawn from a mixture of two highly overlapping Gaussian distributions, whose covariance matrices had a spectral radius of $0.3$ . Moreover, in the case of the second dataset, features randomly selected with probability $0.5$ were set to [math] to emulate sparsity. For both datasets, $80$ samples were used for training via E-R2LML and $320$ samples for testing. Also, $3$ local metrics were employed, while the remaining parameters were set as follows: the PSD step length was set to $10^{-6}$ and the algorithm was allowed to run for $5$ epochs of $500$ iterations each. The classification accuracy using a $5$ -nearest neighbor search is reported in Table I and Table II for various values of the regularization parameters.

These two tables reflect, as expected, that the regularization proves to be very important for E-R2LML, and, by extension, to T-R2LML as well, since the latter one deals with additional parameters to be learned. More specifically, it is shown that cross-validation over $\lambda$ is essential in improving classification accuracy for noisy, potentially sparse, highly overlapping data. This is especially more pronounced for the second dataset, where not employing regularization is clearly inferior to the performance attained by fine-tuning $\lambda$ . Also, for the same dataset, Table II illustrates the sparsity-inducing properties of the nuclear norm regularizer. It is worth noting that, although not specifically shown here, the metrics’ all-[math] columns obtained for $\lambda=10^{3}$ and $\lambda=10^{4}$ followed exactly the sparsity pattern of the relevant features.

IV-B Real datasets

In order to assess the utility of the proposed models, we performed experiments on $18$ datasets, namely, Robot Navigation, Letter Recognition, Pendigits, Wine Quality, Gamma Telescope, Ionosphere, Breast Tissue, Glass, Heart, Sonar, WPBC, Optdigits and Isolet datasets from the UCI machine learning repository222http://archive.ics.uci.edu/ml/datasets.html, and Image Segmentation, Two Norm, Ring Norm datasets from the Delve Dataset Collection333http://www.cs.toronto.edu/~delve/data/datasets.html. We also considered the Columbia University Image Library (COIL20)444http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php and USPS555http://www.gaussianprocess.org/gpml/data/ datasets. Major characteristics of these datasets are summarized in Table III. Following experimental settings similar to the ones used in Wang et al (2012) and Zhu et al (2014), PCA was used on the data of COIL20, Isolet, Optdigits and USPS to reduce their number of features to $30$ , as shown in Table III.

We first explored how the performance of T-R2LML666https://github.com/yinjiehuang/R2LMTL/archive/master.zip and E-R2LML777https://github.com/yinjiehuang/R2LML/archive/master.zip varies with respect to the number of local metrics. Then, we compared T-R2LML and E-R2LML to other state-of-the-art global and local metric learning algorithms, namely, ITML, LMNN, LMNN-MM, GLML and PLML.

IV-B1 Number of local metrics for T-R2LML and E-R2LML

One aspect that was investigated is how the performances of T-R2LML and E-R2LML vary with respect to the number of local metrics $K$ . In Weinberger and Saul (2008), the authors set $K$ equal to the number of classes for each dataset, which might not necessarily be the optimal choice. An abundance of data may imply that more local metrics may be necessary for improved performance; this is an aspect we examined for T-R2LML and E-R2LML. For all datasets, the range of $K$ we considered was $1-7$ , which, aside from COIL20, USPS and Isolet, included the number of classes represented in the data. As we will argue in the sequel, the optimal $K$ does not necessarily coincide with the number of classes of the corresponding classification problem. As a matter of fact, it coincides only in roughly one quarter of the cases.

For T-R2LML, we set the penalty parameter $C$ to $1$ and the regularization parameter $\lambda$ to $10$ . In the case of E-R2LML, $\lambda$ was chosen smaller, since it employs less parameters compared to T-R2LML and, therefore, is less prone to over-fitting. Note that all aforementioned parameter values were selected via cross-validation and subsequently held fixed. Moreover, we terminated our algorithm, if it reached $5$ epochs or when the difference in cost function values between two consecutive iterations was less than $10^{-4}$ . In each epoch, the PSD was ran for $500$ iterations with step length $10^{-5}$ for the Sonar dataset, to $10^{-6}$ for the Ionosphere and Glass datasets, to $10^{-8}$ for the Ring Norm dataset, $10^{-9}$ for the Robert, Letter, Two Norm and Heart datasets, $10^{-10}$ for the COIL20, Isolet, Optdigits and USPS datasets, to $10^{-11}$ for the Pendigits, Image Segmentation, Telescope, Wine Quality and Wpbc datasets and $10^{-13}$ for the Breast Tissue dataset. The MM loop was terminated, if the number of iterations reached $3000$ or when the difference in cost function values between two consecutive iterations was less than $10^{-3}$ .

For E-R2LML, the parameters like $C$ , the number of epochs and the number of iterations were set the same as T-R2LML. The PSD step length was fixed to $10^{-3}$ for the Glass and Sonar datasets, to $10^{-5}$ for the Robot and Ionosphere datasets, to $10^{-6}$ for the Letter, Two Norm, Ring Norm and Optdigits datasets, to $10^{-7}$ for the Isolet and USPS datasets, to $10^{-8}$ for the Wine Quality, Image Segmentations, COIL20 and Heart datasets, to $10^{-9}$ for the Pendigits, Gamma Telescope and Wpbc datasets and to $10^{-11}$ for the Breast Tissue dataset.

The relation between number of local metrics and classification accuracy for each dataset is reported in Figure 2 and Figure 3. Several observations can be made based on these results. First, the results indicate that training with more data does not necessarily imply that an increased value of $K$ is needed for improved performance results. For example, in the case of T-R2LML, for the Pendigits, Wine Qulity, Two Norm, Ring Norm, Glass, Isolet and Optdigits datasets, $2$ local metrics are enough to yield the best results among other choices of $K$ . When E-R2LML is trained with the Telescope and USPS datasets, superior results are obtained using only $2$ metrics. Secondly, one cannot discern a deterministic relationship between the classification accuracy and the number of local metrics utilized that is suitable for all datasets. For the Ring Norm dataset, the classification accuracy is monotonically decreasing with respect to $K$ , while for the remaining datasets, the optimal $K$ varies in a non-apparent fashion with respect to their number of classes. All these observations suggest that validation over $K$ is needed to select the best performing model. Also, one discerns that, although T-R2LML is trained with more data, E-R2LML outperforms it on all datasets except the Telescope, Ionosphere, Breast Tissue, Heart and Wpbc datasets. Finally, from the obtained results results, it becomes apparent that, using both R2LML variants as local metric learning methods (when $K>1$ ) is, more often than not, advantageous compared to the case, when they are used with a single global metric (when $K=1$ ); this is most prominently exhibited in the case of the Heart, Wpbc, Ionosphere and Telescope datasets.

IV-B2 Performance Comparisons

We compared T-R2LML and E-R2LML to several other metric learning algorithms, including Euclidean metric KNN, ITML Davis et al (2007), LMNN Weinberger et al (2006), LMNN-MM Weinberger and Saul (2008), GLML Noh et al (2010) and PLML Wang et al (2012). Both ITML and LMNN learn a global metric, while LMNN-MM, GLML and PLML are local metric learning algorithms. After the metrics are learned for each method, a $5$ -nearest neighbor decision rule was employed to classify unlabeled samples.

For our experiments we used LMNN, LMNN-MM888http://www.cse.wustl.edu/~kilian/code/code.html, ITML999http://www.cs.utexas.edu/~pjain/itml/ and PLML101010http://cui.unige.ch/~wangjun/papers/PLML.zip implementations that were available online. For ITML, a good value of $\gamma$ was found via cross-validation. Also, for LMNN and LMNN-MM, the number of attracting neighbors during training was set to $1$ as suggested in the paper. Additionally, for LMNN, at most $500$ iterations were performed and $30\%$ of training data were used as a validation set. The maximum number of iterations for LMNN-MM was set to $50$ and a step size of $10^{-7}$ was used. For GLML, we chose the optimal $\gamma$ setting via cross-validation. Finally, the PLML hyper-parameter values were chosen as in Wang et al (2012), while $\alpha_{1}$ was chosen via cross-validation. For T-R2LML, the value of the regularization parameter $\lambda$ was cross-validated over $\{10^{-1},1,10^{1},...,10^{6},10^{7}\}$ . The other parameters values used were set as described in Section IV-B1. With respect to E-R2LML, the regularization parameter $\lambda$ was chosen via a validation procedure over the set $\{10^{-2},10^{-1},1,10^{1},10^{2}\}$ . The remaining parameter settings of our methods were the same as the ones used in the previous experiments. Finally, for both methods, $K$ , the number of metrics, is cross-validated over $\{1,2,...,7\}$ .

For pair-wise model comparisons, we employed McNemar’s test. Also, since there were $8$ algorithms to be compared, we used Holm’s step-down procedure as a multiple hypothesis testing method to control the Family-Wise Error Rate (FWER) Hochberg and Tamhane (1987) of the resulting pair-wise McNemar’s tests. The experimental results for a family-wise significance level of $0.05$ are reported in Table IV.

Despite employing a simplistic strategy to infer the weight vector of testing data, E-R2LML achieves the best performance for $14$ out of the $18$ datasets and outperforms its transductive version, while the other methods outperform E-R2LML on the Ring Norm, Isolet, Sonar and Wpbc datasets. GLML’s surprisingly good result for the Ring Norm dataset is probably because GLML assumes a Gaussian mixture underlying the data generation process and the Ring Norm dataset is a $2$ -class recognition problem drawn from a mixture of two multivariate normal distributions. T-R2LML produced best results for $9$ out of the $18$ datasets. We also notice that T-R2LML achieves almost second best results for the remaining datasets except for Robot. For the Ring Norm, Sonar and Wpbc datasets, T-R2LML even outperforms E-R2LML.

Next, PLML exhibits competitive results, more specifically, best in $6$ out of the $18$ cases, but performs poorly on some datasets like COIL20 and Sonar, even worse than KNN. For Glass, Heart, Isolet and Optdigits, PLML’s performance is also quite impressive; it is ranked $2^{nd}$ among the other methods. Regarding ITML, by using a global metric, it is ranked first for $5$ datasets. Often, ITML ranks at least $2^{nd}$ and seems to be suitable for low-dimensional datasets. Finally, GLML rarely performs well; according to Table IV, GLML only achieves $3^{rd}$ or $4^{th}$ ranks for $9$ out of the $18$ datasets.

Another general observation that can be made is the following: employing metric learning is almost always a good choice, since the classification accuracy of utilizing a Euclidean metric is almost always ranked last among all $8$ methods considered. Interestingly, LMNN-MM, even though being a local metric learning algorithm, does not show any significant performance advantages over LMNN (a global metric method); for some datasets, it even obtained lower classification accuracy than LMNN. It is possible that fixing the number of local metrics to the number of classes present in the dataset curtails LMNN-MM’s performance. According to the obtained results, T-R2LML and E-R2LML yield much better performance for all datasets compared to LMNN-MM.

V Conclusions

In this paper, we proposed a new local metric learning framework, namely R2LML. R2LML learns $K$ Mahalanobis-based local metrics that are conically combined, so that pairs of similar points are measured as being located close to each other, in contrast to pairs of dissimilar points, for which the opposite is desired. Two variants of the framework were considered: T-R2LML employs transductive learning to infer the conic combination of metrics to be used for assessing distances between test and training data, while E-R2LML employs a simpler technique to accelerate the learning process. If $T$ is the number of iterations, a local analysis of the block-minimization training procedure of both variants has been shown to be convergent at a rate of $\mathcal{O}(1/\sqrt{T})$ , which is typical for sub-gradient methods.

In order to show the merits of T-R2LML and E-R2LML, we performed a series of experiments involving $18$ benchmark classification problems. First, we studied the effect of regularization in R2LML and showed the importance of the nuclear norm-based regularizer in providing low-rank solutions that avoid over-fitting. Second, we varied the number of local metrics $K$ and discussed its influence on classification accuracy. We concluded that the obtained optimal $K$ does not necessarily equal the number of classes of the dataset under consideration. Also, our results indicate that larger datasets do not necessarily require employing a large number of local metrics. Finally, in a second set of experiments, we compared T-R2LML and E-R2LML to several other global or local metric learning algorithms and demonstrated that our proposed framework is highly competitive.

Acknowledgments

Y. Huang acknowledges partial support from a UCF Graduate College Trustees Doctoral Fellowship and National Science Foundation (NSF) grant No. 1200566. C. Li acknowledges partial support from NSF grants No. 0806931 and No. 0963146. Furthermore, M. Georgiopoulos acknowledges partial support from NSF grants No. 1161228 and No. 0525429, while G. C. Anagnostopoulos acknowledges partial support from NSF grant No. 1263011. Note that any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

Appendix A

In order to solve Problem (3) and Problem (4), the following PSD update scheme is used:

[TABLE]

Above, $\boldsymbol{g}^{f}_{t}\in\partial f(\boldsymbol{w}_{t})$ and $\eta$ is a fixed step length. PSD first computes the unconstrained subgradient with respect to $f$ .

In the second step, we find a new $\boldsymbol{w}_{t}$ from the intermediate result $\boldsymbol{w}_{t+\frac{1}{2}}$ . By the first order optimality condition, with the minimizer $\boldsymbol{w}$ , it holds that:

[TABLE]

In light of Eq. (17), the above property amounts to:

[TABLE]

Since $\boldsymbol{w}_{t+1}$ is the minimizer of Eq. (18), there is a vector $\boldsymbol{g}^{r}_{t+1}\in\partial r(\boldsymbol{w}_{t+1})$ such that Eq. (19) holds, i.e.

[TABLE]

Finally, we have the following PSD update rule:

[TABLE]

With the definitions of $\left\|\partial f(\boldsymbol{w})\right\|$ and $\left\|\partial r(\boldsymbol{w})\right\|$ in Section III-C, we provide Lemma 4 as follows. Note that, unless specified otherwise, $\left\|\cdot\right\|$ will stand for the $L_{2}$ norm.

Lemma 4.

Assume that the subgradients of $f$ and $r$ are bounded as in Eq. (15) for some positive scalars $A$ and $G$ . Let $\eta\geq 0$ be a fixed step length and $\boldsymbol{w}^{*}$ be the minimizer of $f(\boldsymbol{w})+r(\boldsymbol{w})$ . Then, for a constant $c\leq 4$ we have:

[TABLE]

Proof.

By the definition of the subgradient, $\boldsymbol{g}^{r}_{t+1}\in\partial r(\boldsymbol{w+1})$ and $r$ ’s convexity:

[TABLE]

Additionally, the following relations hold:

[TABLE]

where the second step is due to Cauchy-Schwarz inequality.

Now we relate $\left\|\boldsymbol{w}_{t+1}-\boldsymbol{w}^{*}\right\|$ to $\left\|\boldsymbol{w}_{t}-\boldsymbol{w}^{*}\right\|$ as follows:

[TABLE]

In Eq. (24), $\eta^{2}\left\|\boldsymbol{g}^{f}_{t}+\boldsymbol{g}^{r}_{t+1}\right\|^{2}$ can be bounded as follows:

[TABLE]

When Eq. (Proof.) and Eq. (Proof.) are substituted into Eq. (24), which obtain:

[TABLE]

The convexities of both of $f(\boldsymbol{w})$ and $r(\boldsymbol{w})$ imply that:

[TABLE]

The following also holds:

[TABLE]

By substituting Eq. (27), Eq. (28) and Eq. (29) into Eq. (Proof.), we obtain

[TABLE]

By choosing $c\leq 4$ , the second inequality holds. After some algebra, one can derive Eq. (4) from Eq. (Proof.).

∎

The following is the detailed proof of Theorem 2:

Proof.

By Lemma 4, we have:

[TABLE]

Summing Eq. (Proof.) over $t=1,\ldots,T$ we get

[TABLE]

The last inequality holds because $\left\|\boldsymbol{w}^{*}\right\|^{2}\leq D$ and $\boldsymbol{w}_{1}=0$ as described in Theorem 2. For part of Eq. (Proof.), it holds:

[TABLE]

The second equality holds due to the assumptions that $\boldsymbol{w}_{1}=\boldsymbol{0}$ and $r(\boldsymbol{0})=0$ . Besides, given the step length $\eta$ , this term $\eta(1-cA\eta)r(\boldsymbol{w}_{T+1})$ is larger than [math], which establishes the last inequality. Now, when substituting Eq. (Proof.) back into Eq. (Proof.), we get

[TABLE]

Additionally, the following holds:

[TABLE]

Based on Eq. (Proof.), Eq. (35) and choosing $\eta=\frac{D}{\sqrt{8T}G}$ , we obtain the main result shown of Theorem 2.

∎

Appendix B

In this section, we provide the detailed proof of Theorem 3 in Section III-C.

Proof.

We first prove that each of the two or three block minimizations in our algorithms decrease the objective function value under consideration. This is true for the first block minimization, according to Theorem 2. For the second block, since a MM algorithm is used, we have the following relationships:

[TABLE]

This implies that the second block minimization does not increase the objective function value. The optimal algorithm for the third block also guarantees the non-increasing nature of the cost function. Since the objective function is lower-bounded, Algorithm 1 converges.

Next, we prove that the set of fixed points of the proposed Algorithm 1 includes the KKT points of Problem (4). Towards this purpose, suppose the algorithm has converged to a KKT point $\left\{\boldsymbol{L}^{k*},\boldsymbol{g}^{k*}\right\}_{k\in\mathbb{N}_{K}}$ ; then, it suffices to show that this point is also a fixed point of the algorithm’s iterative map. For notational brevity, let $f_{0}(\boldsymbol{L}^{k},\boldsymbol{g}^{k})$ , $f_{1}(\boldsymbol{g}^{k})$ and $h_{1}(\boldsymbol{g}^{k})$ be the cost function, inequality constraint and equality constraint of Problem (4) respectively. By definition, a KKT point will satisfy

[TABLE]

In relation to Problem (7), which the second block tries to solve, by setting the gradient of the problem’s Lagrangian to $\boldsymbol{0}$ , the KKT point will satisfy the following equality:

[TABLE]

Problem (8) can be solved based on Eq. (12) of Theorem 1; in specific, we obtain that

[TABLE]

Substituting Eq. (38) and $\boldsymbol{H}=\boldsymbol{\tilde{S}}+\mu\boldsymbol{I}$ into Eq. (39), one immediately obtains that

[TABLE]

In other words, step $2$ of Algorithm 1 will not update the solution. Now, if we substitute Eq. (38) back into Eq. (37), we obtain $\boldsymbol{0}\in\partial_{\boldsymbol{L}^{k}}f_{0}(\boldsymbol{L}^{k*},\boldsymbol{g}^{k*})$ for all $k$ , which is the optimality condition for the subgradient method; the PSD step (the first block minimization of Algorithm 1) will also not update the solution. Thus, a KKT point of Problem (4) is a fixed point of our algorithm.

Finally, we prove that the set of fixed points of the proposed Algorithm 1 includes the KKT points of Problem (3). We assume the algorithm has converged to a KKT point $\left\{\boldsymbol{L}^{k*},\boldsymbol{g}^{k*}\right\}_{k\in\mathbb{N}_{K}}$ and $S^{*}$ is the true similarity matrix. Similar to the previous proof, we start from the second block. Following the same procedure, we find the second block will not update the solution of vector $\boldsymbol{g}^{*}$ . Now, during the third block minimization, the $\psi_{mn}$ quantities remain unchanged, since $\boldsymbol{g}^{*}$ does not change. The minimization procedure we proposed for the third block will leave the similarity matrix unchanged, since the coefficient matrix with elements $\psi_{mn}$ is fixed. Now, if Eq. (38) is substituted back into Eq. (37), we obtain the optimality condition for the first block minimization. Thus, the first block will also not update the solution. Therefore, a KKT point of Problem (3) is a fixed point of Algorithm 1.

∎

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Balaji and Bapat (2007) Balaji R, Bapat R (2007) On euclidean distance matrices. Linear Algebra and its Applications 424(1):108 – 117
2Bennett (1999) Bennett KP (1999) Advances in kernel methods. MIT Press, Cambridge, MA, USA
3Bilenko et al (2004) Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-spervised clustering. In: Proceedings of the International Conference on Machine Learning (ICML), ACM, pp 81–88
4Candès and Recht (2008) Candès EJ, Recht B (2008) Exact matrix completion via convex optimization. Co RR abs/0805.4471
5Candès and Tao (2009) Candès EJ, Tao T (2009) The power of convex relaxation: near-optimal matrix completion. Co RR abs/0903.1476
6Chen et al (2009) Chen X, Pan W, Kwok JT, Carbonell JG (2009) Accelerated gradient method for multi-task sparse learning problem. In: Proceedings of the International Conference on Data Mining (ICDM), IEEE Computer Society, pp 746–751
7Chen et al (2002) Chen Y, Wang G, Dong S (2002) Learning with progressive transductive support vector machine. In: Proceedings of the International Conference on Data Mining (ICDM), IEEE Computer Society, pp 67–74
8Chopra et al (2005) Chopra S, Hadsell R, Lecun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), IEEE Press, pp 539–546

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Reduced-Rank Local Distance Metric Learning

Abstract

I Introduction

II Problem Formulation

III Algorithm

III-A Two-Block Algorithm for E-R2LML

Theorem 1**.**

Proof.

III-B The Three-Block Algorithm Variant for T-R2LML

III-C Analysis

Theorem 2**.**

Theorem 3**.**

IV Experiments

IV-A Effects of our nuclear norm-based regularization

IV-B Real datasets

IV-B1 Number of local metrics for T-R2LML and E-R2LML

IV-B2 Performance Comparisons

V Conclusions

Acknowledgments

Appendix A

Lemma 4**.**

Proof.

Proof.

Appendix B

Proof.

Theorem 1.

Theorem 2.

Theorem 3.

Lemma 4.