Lifelong Metric Learning

Gan Sun; Yang Cong; Ji Liu; Xiaowei Xu

arXiv:1705.01209·cs.LG·June 13, 2017

Lifelong Metric Learning

Gan Sun, Yang Cong, Ji Liu, Xiaowei Xu

PDF

Open Access

TL;DR

This paper introduces Lifelong Metric Learning (LML), a framework enabling models to learn new metrics incrementally while retaining previous knowledge, inspired by human learning, and optimized via online algorithms.

Contribution

The paper proposes a novel lifelong metric learning framework that maintains a shared subspace, transfers knowledge to new tasks, and updates over time, advancing online multi-task metric learning.

Findings

01

Effective in multi-task metric learning datasets

02

Maintains performance across tasks over time

03

Demonstrates efficiency and effectiveness

Abstract

The state-of-the-art online learning approaches are only capable of learning the metric for predefined tasks. In this paper, we consider lifelong learning problem to mimic "human learning", i.e., endowing a new capability to the learned metric for a new task from new online samples and incorporating previous experiences and knowledge. Therefore, we propose a new metric learning framework: lifelong metric learning (LML), which only utilizes the data of the new task to train the metric model while preserving the original capabilities. More specifically, the proposed LML maintains a common subspace for all learned metrics, named lifelong dictionary, transfers knowledge from the common subspace to each new metric task with task-specific idiosyncrasy, and redefines the common subspace over time to maximize performance across all metric tasks. For model optimization, we apply online passive…

Tables4

Table 1. TABLE I: Statistics of the Base Metric Learning Models

Name	Metric Type	Metric Function	$△_{t}$ in Eq. (14)
OASIS [9]	Similarity	$ℓ_{t} (M_{t}) = \max (0, 1 - s_{M_{t}} (x_{i}, x_{j}) + s_{M_{t}} (x_{i}, x_{k}))$	$\sum_{(i, j, k) \in 𝒯} (x_{i} {(x_{k} - x_{j})}^{T} + (x_{k} - x_{j}) x_{i}^{T})$
SCML [37]	Distance	$ℓ_{t} (M_{t}) = \max (0, 1 - d_{M_{t}} (x_{i}, x_{j}) + d_{M_{t}} (x_{i}, x_{k}))$	$\sum_{(i, j, k) \in 𝒯} ((x_{i} - x_{j}) {(x_{i} - x_{j})}^{T} - (x_{i} - x_{k}) {(x_{i} - x_{k})}^{T})$

Table 2. TABLE II: Sentiment & \& Isolet dataset: classification error and training time of the competing metric learning models. The reported performance is averaged over five random repetitions, and methods with the best performance are marked as bolded black.

Dataset	Task	stEuc	stLMNN	stSCML	stOASIS	uLMNN	uSCML	mtLMNN	mtSCML	ELLA	Ours+OASIS	Ours+SCML
$\frac{Senti}{ment}$	Books	33.5 $\pm$ 0.5	29.7 $\pm$ 0.4	27.0 $\pm$ 0.5	28.3 $\pm$ 0.4	29.6 $\pm$ 0.4	28.0 $\pm$ 0.4	29.1 $\pm$ 0.4	25.8 $\pm$ 0.4	32.8 $\pm$ 0.5	27.8 $\pm$ 0.4	25.3 $\pm$ 0.5
	DVD	33.9 $\pm$ 0.5	29.4 $\pm$ 0.5	26.8 $\pm$ 0.4	23.5 $\pm$ 0.4	29.4 $\pm$ 0.5	27.9 $\pm$ 0.5	29.5 $\pm$ 0.5	26.5 $\pm$ 0.5	31.0 $\pm$ 0.7	23.5 $\pm$ 0.5	25.0 $\pm$ 0.4
	Electronics	26.2 $\pm$ 0.4	23.3 $\pm$ 0.4	21.1 $\pm$ 0.5	20.3 $\pm$ 0.4	25.1 $\pm$ 0.4	22.9 $\pm$ 0.4	22.5 $\pm$ 0.4	20.2 $\pm$ 0.5	19.0 $\pm$ 0.7	18.0 $\pm$ 0.4	18.5 $\pm$ 0.4
	Kitchen	26.2 $\pm$ 0.6	21.2 $\pm$ 0.5	19.0 $\pm$ 0.4	17.3 $\pm$ 0.4	23.5 $\pm$ 0.3	21.9 $\pm$ 0.5	22.1 $\pm$ 0.5	19.0 $\pm$ 0.4	16.1 $\pm$ 0.5	12.0 $\pm$ 0.4	15.8 $\pm$ 0.4
	Avg. Error	30.0 $\pm$ 0.2	25.9 $\pm$ 0.2	23.5 $\pm$ 0.2	22.4 $\pm$ 0.3	26.9 $\pm$ 0.2	25.2 $\pm$ 0.2	25.8 $\pm$ 0.2	22.9 $\pm$ 0.2	24.8 $\pm$ 0.4	20.3 $\pm$ 0.2	21.2 $\pm$ 0.4
	Avg. Runtime	N/A	11min	12s	0.5min	9min	10s	8min	1min	0.5min	0.5min	18s
Isolet	Isolet1	28.9 $\pm$ 0.0	23.2 $\pm$ 0.1	19.6 $\pm$ 0.2	24.5 $\pm$ 0.2	23.2 $\pm$ 0.1	55.3 $\pm$ 0.1	21.3 $\pm$ 0.7	19.1 $\pm$ 0.2	N/A	21.7 $\pm$ 0	16.5 $\pm$ 0.1
	Isolet2	30.5 $\pm$ 0.7	24.4 $\pm$ 0.9	20.9 $\pm$ 1.6	19.9 $\pm$ 0.0	24.4 $\pm$ 0.9	52.9 $\pm$ 1.0	22.9 $\pm$ 0.6	20.1 $\pm$ 2.1	N/A	23.3 $\pm$ 1.3	22.4 $\pm$ 2.1
	Isolet3	35.3 $\pm$ 1.2	28.4 $\pm$ 1.2	24.5 $\pm$ 0.3	25.8 $\pm$ 0.8	28.4 $\pm$ 1.2	53.1 $\pm$ 2.8	26.0 $\pm$ 1.2	22.9 $\pm$ 0.2	N/A	21.0 $\pm$ 1.4	23.3 $\pm$ 0.0
	Isolet4	35.7 $\pm$ 0.4	27.2 $\pm$ 1.9	25.3 $\pm$ 2.4	29.4 $\pm$ 2.7	27.2 $\pm$ 1.9	53.5 $\pm$ 1.8	25.3 $\pm$ 2.4	23.8 $\pm$ 2.8	N/A	23.1 $\pm$ 1.0	25.1 $\pm$ 0.2
	Isolet5	37.4 $\pm$ 0.5	30.2 $\pm$ 0.8	26.7 $\pm$ 1.0	29.7 $\pm$ 0.8	30.2 $\pm$ 0.8	54.6 $\pm$ 3.1	28.3 $\pm$ 1.4	25.7 $\pm$ 2.8	N/A	21.4 $\pm$ 1.6	27.5 $\pm$ 0.4
	Avg. Error	33.5 $\pm$ 0.3	26.7 $\pm$ 0.7	23.4 $\pm$ 0.9	25.9 $\pm$ 1.1	26.7 $\pm$ 0.7	53.9 $\pm$ 1.7	24.7 $\pm$ 0.7	22.3 $\pm$ 1.5	N/A	22.1 $\pm$ 1.2	23.0 $\pm$ 0.5
	Avg. Runtime	N/A	4min	0.1min	1min	4min	1s	10min	0.5min	N/A	2min	0.3min

Table 3. TABLE III: USPS dataset: classification error of the competing metric learning model. The reported performance is averaged over five random repetitions, and methods with the best performance are marked as bolded black.

Task	$#$ Classes	stEuc	stLMNN	stSCML	stOASIS	mtLMNN	mtSCML	ELLA	Ours+OASIS	Ours+SCML
1	2	0.13 $\pm$ 0.1	0.09 $\pm$ 0.0	0.06 $\pm$ 0.0	0.37 $\pm$ 0.2	0.09 $\pm$ 0.1	0.09 $\pm$ 0.1	N/A	0.22 $\pm$ 0.2	0.03 $\pm$ 0.1
2	2	2.20 $\pm$ 0.8	2.05 $\pm$ 0.6	2.20 $\pm$ 0.4	1.86 $\pm$ 0.0	2.10 $\pm$ 0.6	1.70 $\pm$ 0.1	N/A	1.52 $\pm$ 0.6	2.12 $\pm$ 0.0
3	3	4.19 $\pm$ 1.0	3.59 $\pm$ 0.8	3.18 $\pm$ 0.3	5.23 $\pm$ 0.4	3.92 $\pm$ 1.2	3.41 $\pm$ 1.3	N/A	4.58 $\pm$ 0.1	5.74 $\pm$ 1.4
4	3	7.00 $\pm$ 2.0	6.67 $\pm$ 1.6	6.82 $\pm$ 1.4	4.40 $\pm$ 0.2	6.60 $\pm$ 2.0	5.04 $\pm$ 0.6	N/A	4.43 $\pm$ 1.1	4.51 $\pm$ 0.3
Avg. Error	N/A	3.37 $\pm$ 0.0	3.10 $\pm$ 0.0	3.07 $\pm$ 0.1	2.97 $\pm$ 0.1	3.18 $\pm$ 0.1	2.66 $\pm$ 0.2	N/A	2.68 $\pm$ 0.1	3.09 $\pm$ 0.3

Table 4. TABLE IV: Statistics of the label-consistent datasets:

Dataset	#Classes	#Samples	#Dimension	#Tasks	Problem Type
Sentiment	2	6400	200	4	Different Task
Isolet	26	7797	617	5	Same Task

Equations45

f_{L_{t}} (x_{t i}, x_{t j}) = x_{t i}^{T} L_{t}^{T} L_{t} x_{t j} =: f_{t, ij} (L_{t}^{T} L_{t}) .

f_{L_{t}} (x_{t i}, x_{t j}) = x_{t i}^{T} L_{t}^{T} L_{t} x_{t j} =: f_{t, ij} (L_{t}^{T} L_{t}) .

f_{L_{t}} (x_{t i}, x_{t j}) = △ x_{t, ij}^{T} L_{t}^{T} L_{t} △ x_{t, ij} =: f_{t, ij} (L_{t}^{T} L_{t}),

f_{L_{t}} (x_{t i}, x_{t j}) = △ x_{t, ij}^{T} L_{t}^{T} L_{t} △ x_{t, ij} =: f_{t, ij} (L_{t}^{T} L_{t}),

\ell_{t}(L_{t}^{T}L_{t})=\ell_{t}\Big{(}f_{t,ij}(L_{t}^{T}L_{t})\Big{)},\quad\forall(i,j)\in\mathcal{S}_{t}\cup\mathcal{D}_{t},

\ell_{t}(L_{t}^{T}L_{t})=\ell_{t}\Big{(}f_{t,ij}(L_{t}^{T}L_{t})\Big{)},\quad\forall(i,j)\in\mathcal{S}_{t}\cup\mathcal{D}_{t},

f_{L_{t}} (x_{t i}, x_{t j}) = f_{R_{t}} (\overset{x}{^}_{t i}, \overset{x}{^}_{t j}),

f_{L_{t}} (x_{t i}, x_{t j}) = f_{R_{t}} (\overset{x}{^}_{t i}, \overset{x}{^}_{t j}),

M_{t} = L_{t}^{T} L_{t} = L_{0}^{T} R_{t}^{T} R_{t} L_{0} = L_{0}^{T} W_{t} L_{0} = i = 1 \sum d j = 1 \sum d w_{ij} l_{i} l_{j},

M_{t} = L_{t}^{T} L_{t} = L_{0}^{T} R_{t}^{T} R_{t} L_{0} = L_{0}^{T} W_{t} L_{0} = i = 1 \sum d j = 1 \sum d w_{ij} l_{i} l_{j},

L_{0}, {W_{t}} min

L_{0}, {W_{t}} min

+ γ ∥ L_{0} ∥_{F}^{2},

L_{0}, {W_{t}} min

L_{0}, {W_{t}} min

+ γ ∥ L_{0} ∥_{F}^{2},

L_{0}, {W_{t}} min

L_{0}, {W_{t}} min

\displaystyle+\lambda_{t}\left\|W_{t}\right\|_{1,\mathrm{off}}\Big{\}}+\gamma\left\|L_{0}\right\|_{F}^{2},

\displaystyle\min_{L_{0},\{W_{t}\}}\frac{1}{m}\Big{\{}\sum_{t=1}^{m}\ell_{t}\Big{(}f_{t,ij}(M_{t})\Big{)}+\langle L_{0}^{T}W_{t}L_{0}-M_{t},G_{t}\rangle

\displaystyle\min_{L_{0},\{W_{t}\}}\frac{1}{m}\Big{\{}\sum_{t=1}^{m}\ell_{t}\Big{(}f_{t,ij}(M_{t})\Big{)}+\langle L_{0}^{T}W_{t}L_{0}-M_{t},G_{t}\rangle

\displaystyle+\frac{1}{2\eta}\left\|L_{0}^{T}W_{t}L_{0}-M_{t}\right\|_{F}^{2}+\lambda_{t}\left\|W_{t}\right\|_{1,\mathrm{off}}\Big{\}}+\gamma\left\|L_{0}\right\|_{F}^{2},

L_{0}, {W_{t}} min

L_{0}, {W_{t}} min

+ γ ∥ L_{0} ∥_{F}^{2} .

L_{0}, {W_{t}} min

L_{0}, {W_{t}} min

+ γ ∥ L_{0} ∥_{F}^{2},

W_{t} min f (W_{t}) + g (W_{t}),

W_{t} min f (W_{t}) + g (W_{t}),

W_{t}^{i + 1} = ar g W min \frac{1}{2} W - W_{t}^{i} + η_{i} \nabla f (W_{t}^{i})_{F}^{2} + λ_{t} ∥ W ∥_{1, off},

W_{t}^{i + 1} = ar g W min \frac{1}{2} W - W_{t}^{i} + η_{i} \nabla f (W_{t}^{i})_{F}^{2} + λ_{t} ∥ W ∥_{1, off},

\nabla f (W_{t}^{i}) = L_{0} L_{0}^{T} W_{t}^{i} L_{0} L_{0}^{T} - L_{0} M_{t} L_{0}^{T} .

\nabla f (W_{t}^{i}) = L_{0} L_{0}^{T} W_{t}^{i} L_{0} L_{0}^{T} - L_{0} M_{t} L_{0}^{T} .

\mathrm{pros}(W,\lambda_{t}\eta_{i})=\mathrm{sgn}(W)\odot\Big{(}\lvert W\lvert-\lambda_{t}\eta_{i}(1-I)\Big{)}_{+},

\mathrm{pros}(W,\lambda_{t}\eta_{i})=\mathrm{sgn}(W)\odot\Big{(}\lvert W\lvert-\lambda_{t}\eta_{i}(1-I)\Big{)}_{+},

\frac{1}{m} t = 1 \sum m

\frac{1}{m} t = 1 \sum m

= \frac{1}{m} t = 1 \sum m

f (W_{t}) + g (W_{t}) - f (W^{*}) - g (W^{*})

f (W_{t}) + g (W_{t}) - f (W^{*}) - g (W^{*})

\leq \frac{2 η _{t} ∥ W _{0} - W ^{*} ∥ _{F}^{2}}{ϵ} - 1.

\frac{1}{m} t = 1 \sum m s^{(t)} min

\frac{1}{m} t = 1 \sum m s^{(t)} min

+ λ ∥ L ∥_{F}^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition

Full text

Lifelong Metric Learning

Gan Sun, Yang Cong, , Ji Liu, Xiaowei Xu G. Sun is with the State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences, China, 110016 e-mail: [email protected]. Cong is with the State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, China, 110016 e-mail: [email protected]. Liu is with the Department of Computer Science, University of Rochester, USA, e-mail: [email protected]. Xu is with Department of Information Science, University of Arkansas at Little Rock, USA e-mail: [email protected]

Abstract

The state-of-the-art online learning approaches are only capable of learning the metric for predefined tasks. In this paper, we consider lifelong learning problem to mimic “human learning”, i.e., endowing a new capability to the learned metric for a new task from new online samples and incorporating previous experiences and knowledge. Therefore, we propose a new metric learning framework: lifelong metric learning (LML), which only utilizes the data of the new task to train the metric model while preserving the original capabilities. More specifically, the proposed LML maintains a common subspace for all learned metrics, named lifelong dictionary, transfers knowledge from the common subspace to each new metric task with task-specific idiosyncrasy, and redefines the common subspace over time to maximize performance across all metric tasks. For model optimization, we apply online passive aggressive optimization algorithm to solve the proposed LML framework, where the lifelong dictionary and task-specific partition are optimized alternatively and consecutively. Finally, we evaluate our approach by analyzing several multi-task metric learning datasets. Extensive experimental results demonstrate effectiveness and efficiency of the proposed framework.

Index Terms:

Lifelong Learning, Metric Learning, Multi-task Learning, Low-rank Subspace.

I Introduction

Online metric / similarity learning has received remarkable success in a variety of applications [1, 2, 3], such as data mining [4], information retrieval [5] and computer vision [6, 7], mainly due to its high efficiency and scalability to large-scale dataset. Different from conventional batch learning methods that learn metric model offline with all training samples, online learning aims to exploit one or a group of samples each time to update the metric model iteratively, and is ideally appropriate for tasks in which data arrives sequentially.

However, most state-of-the art online metric learning models [8, 9, 10] can only achieve online learning from fixed predefined $t$ ( $t>0$ ) metric tasks and cannot add the new task. In this paper, we consider the lifelong learning problem to mimic the “human learning”, i.e., how to extend the current metric to new tasks while the current functionality of the metric remains. For example, in speech recognition, different people pronouncing the same word differs greatly based on their gender, accent, nationality, or other individual characteristics, and it is highly beneficial to leverage the similarities of datasets from different types of speakers while adapting to the specifics of each particular users. Therefore, the speech recognition library should be delivered to coming speaker’s recognition with a set of default speech recognition capabilities, and new speaker-specific metric models need to be added. Another motivating example is in image classification system: a metric learning system can identify whether an image contains an apple or banana, however the user wishes to expand this ability to a new task, e.g., detecting an orange. To achieve this goal, most state-of-the-arts [11, 12] should storage training data of all tasks and retrain their models in a time consuming way. Therefore, the key challenge lies on how to learn and accumulate knowledge continuously where early samples are not accessible in the online scenario.

As depicted in Fig. 1, in this paper, we propose a new framework, called lifelong metric learning (LML), which intends to learn shared metric parameters from old ones without degrading performance or accessing to the old training data of $t$ tasks. Based on the assumption that all tasks are retained in a low-dimensional common subspace, LML learns a library called “lifelong dictionary” as a set of shared basis for all metric models, while the learned model of $t$ tasks can be considered as a sparse combination of this discriminative lifelong dictionary. Specifically, the lifelong dictionary can be initialized by extracting efficiently from the first training task at different regions via clustering. As new $t+1$ -th task arrives, LML transfers knowledge through the shared base of lifelong dictionary to learn the new metric model with sparsity regularization, and refines the lifelong dictionary with first-order information from both the new task and previous tasks. By updating the lifelong dictionary continuously, the fresh knowledge is incorporated into the existing lifelong dictionary, thereby improving the performance of previously learned $t$ models. Therefore, model of new $t+1$ -th task can be obtained without accessing to previous training data. To this end, we evaluate LML framework against state-of-the-art multi-task metric learning methods on several datasets. The experimental results validate encouraging performances of the proposed LML framework.

The contributions of this paper include:

•

To the best of our knowledge, this is the first work about online metric learning from the perspective of lifelong learning, which adopts previous experience and knowledge of $t$ tasks to incorporate and learn the new $t+1$ -th task, and can improve the performance in classification accuracy and reduce training time accordingly.

•

With the support of discriminative “lifelong dictionary”, our proposed lifelong metric learning framework can model a new task via sparse combination, which can reduce the storage burden without saving the training data of previous $t$ tasks but first-order information.

•

We conduct comparisons and experiments with several real-world datasets, which verify the lower computational cost and higher improvement created by our LML framework.

The rest of the paper is structured as follows. Section II gives a brief review of some related works. Section III introduces our proposed lifelong metric learning formulation. Section IV then proposes how to solve the proposed model efficiently via online passive aggressive optimization algorithm. In Section V, we report the experimental results and conclude this paper in Section VI.

II RELATED WORKS

Metric learning and its related methods have a long history. Depending on whether metric learning incorporates multi-task learning, metric learning can be roughly categorized as: Single Metric Learning and Multi-task Metric Learning.

II-A Single Metric Learning

To the best of our knowledge, seeking a better distance metric through learning with a training dataset is at the key issue of of most state-of-the-art single metric learning models [13, 14, 15]. For the distance metric based researches, the representative approaches can be categorized into two key issues: batch metric learning and online metric learning.

The batch metric learning models [16, 17, 18, 19, 20, 21] can further be divided into two categories: models based on nearest neighbors, such as [22] optimizes the expected leave-one-out error of a stochastic nearest classifier in the projection space and [1] proposes the most widely-used Mahalanohis distance learning Large Margin Nearest Neighbors (LMNN), i.e., learning a Mahalanobis distance metric for $k$ NN classification for labeled training examples; models based on pairs/triplets, for instance, [23] searches for a clustering that puts the similar pairs into the same clusters and dissimilar pairs into different clusters; [24] promotes input sparsity by imposing a group sparsity penalty on the learned metric and a trace constraint to encourage output sparsity; [20] proposes a novel low-rank metric learning algorithm to yield bilinear similarity functions which can be applicable to high-dimensional data domains. However, batch metric learning models which assume all training samples are available prior to the learning phase cannot be applied into many practical applications, due to the fact that only a small amount of training samples are available in the beginning and others would come sequentially. Therefore, researchers focus on the online metric learning and intend to train the classifier with the new coming data.

For the online metric learning, [9] designs an Online Algorithm for Scalable Image Similarity learning (OASIS), for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. However, OASIS may suffer from over-fitting and be difficult to be applied in the case of the high dimensions. Furthermore, computational complexity of learning full-rank metric can ranging from $O(d^{2})$ to $O(d^{6.5})$ , when metric learner lies in a high-dimensional sample space $\mathbb{R}^{d}$ and $d$ is the dimension of the training dataset. In order to overcome over-fitting problem, OMLLR [10] proposes a novel online metric learning model with the low rank constraint, where low-rank metric enables to reduce storage of metric matrices. [25] incorporates large-scale high-dimensional dataset into sparse online metric learning, and explore its application to image retrieval. In addition, LORETA [26] describes an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. To incorporate the benefits of both online learning and Mahalanobis distance, LEGO [8] using a Log-Det regularization per instance loss, is guaranteed to yield a positive semidefinite matrix. Furthermore, more details can also be found in two surveys [27] and [28].

II-B Multi-task Metric Learning

Based on the assumption that the relationships and information shared among the different tasks can be taken into account, multi-task learning [29, 30, 31, 32, 33, 34, 35, 36] aims to improve generalization performance by learning multiple related tasks simultaneously. Furthermore, there are few multi-task metric learning methods designed to make metric learning benefit from training all tasks simultaneously. With the assumption that multiple tasks share a common Mahalanobis metric and each task has a task-specific metric, mtLMNN [11] adopts the LMNN formulation to the multi-task learning. However, mtLMNN is computationally more complicated, especially in the case of high dimensions. Specifically, there are $(t+1)d^{2}$ ( $t$ and $d$ denote the task number and data dimension, respectively) parameters to be optimized. Based on low-rank based assumptions, [12] presents transformation matrix to the problem of multi-task metric learning by learning a common subspace for all tasks and an individual metric for each task, where each individual metric is restricted in the common subspace. In addition, mtSCML [37] constructs a common basis set, multi-metric are regularized to be relevant across tasks (as favored by the group sparsity). However, storage and computation will become cumbersome with large scale tasks. Therefore, in order to address the situation that total number of tasks is large or the task is coming consecutively, we employ the common subspace as the lifelong dictionary, and then build a more robust lifelong metric learning framework.

Notations: For matrix $W\in\mathbb{R}^{m\times n}$ , let $w_{ij}$ be the entry in the $i$ -th row and $j$ -th column of $W$ . Let us define some norms, $\left\|W\right\|_{0}$ is the number of nonzero entries in $W$ ; denote by $\left\|W\right\|_{1}=\sum_{i=1}^{m}\sum_{j=1}^{n}|w_{ij}|$ and $\left\|W\right\|_{\infty}=\max_{i,j}|w_{ij}|$ the $\ell_{1}$ -norm and $\ell_{\infty}$ -norm of $W$ , respectively. Let $\left\|W\right\|_{2,1}=\sum_{i=1}^{m}\left\|w_{i}\right\|_{2}$ ; denote by $\mathrm{sgn}(W)$ , $(W)_{+}$ and $\lvert\cdot\lvert$ the elementwise sign, positive part elementwise and absolute value of matrix $W$ , respectively. Let $\odot$ be the elementwise multiplication.

III Lifelong Metric Learning

III-A Preliminaries

Assume that there are $m$ related tasks. $(X_{t},Y_{t})$ denotes the training set to the $t$ -th task with $\{x_{ti}\in\mathbb{R}^{\hat{d}},i=1,\ldots,n_{t}\}$ , where $\hat{d}$ and $n_{t}$ are the dimension and the number of the training samples of $t$ -th task, respectively. Define $n=\sum_{t=1}^{m}n_{t}$ to be the total number of samples, $m$ is the total number of tasks and $f_{t}:\mathbb{R}^{\hat{d}}\times\mathbb{R}^{\hat{d}}\rightarrow\mathbb{R}$ to be the similarity / distance metric of the $t$ -th learning tasks. The $f_{t}$ is assumed to be defined based on a linear transformation $L_{t}:\mathbb{R}^{\hat{d}}\rightarrow\mathbb{R}^{d}$ (with $d\ll\hat{d}$ to obtain a low dimensional representation) as:

•

Similarity Function:

[TABLE]

•

Distance Function:

[TABLE]

where $x_{ti}$ and $x_{tj}$ are feature vectors, and $\triangle x_{t,ij}=x_{ti}-x_{tj}$ . $L_{t}^{T}L_{t}\in\mathbb{R}^{\hat{d}\times\hat{d}}$ must be positive semi-definite to satisfy the properties of a similarity / distance metric. The set of triplets $\mathcal{T}_{t}=\{(i,j,k)|(i,j)\in\mathcal{S}_{t},(i,k)\in\mathcal{D}_{t}\}$ are used to define the side-information in $X_{t}$ , where $\mathcal{S}_{t}$ and $\mathcal{D}_{t}$ denote all the similar and dissimilar pairs, respectively. For example, $f_{L_{t}}(x_{ti},x_{tj})\leq f_{L_{t}}(x_{ti},x_{tk})$ implies similar data pairs $\{(x_{ti},x_{tj})|(i,j)\in\mathcal{S}_{t}\}$ to stay closer than dissimilar pairs $\{(x_{ti},x_{tk})|(i,k)\in\mathcal{D}_{t}\}$ depending on the similarity / distance metric $f_{t}$ . Without specially specifying, the similarity and distance function are denoted as $f_{L_{t}}(x_{ti},x_{tj})$ in the following.

III-B The Lifelong Metric Learning Problem

The original intention of multi-task metric learning is to learn an appropriate distance metric $f_{t}$ for $t$ -th task utilizing all the side-information from the joint training set $\{(X_{1},Y_{1}),(X_{2},Y_{2}),\ldots,(X_{m},Y_{m})\}$ . Suppose that the loss involved in $t$ -th task is determined by the distance function $f_{t}$ (with metric $L_{t}$ ) and the pairs appearing in $\mathcal{S}_{t}$ and $\mathcal{D}_{t}$ :

[TABLE]

where $\ell_{t}$ is an arbitrary loss function of $t$ -th task. However, learning new metric task without accessing to the previously used training data is not considered by traditional multi-task metric learning. In the context of multi-task metric learning, a lifelong metric learning system encounters a series of metric learning tasks $\ell_{1},\ell_{2},\ldots,\ell_{m}$ , where each task $\ell_{t}$ is defined by Eq. (3). For convenience, we do not assume that the learner knows any information about tasks, e.g., the total number of tasks $m$ , the distribution of these tasks, etc. In each time step, as the lifelong system receives a batch of training data for some metric learning task $t$ , either a new metric task or previously learning task, this system may be asked to make predictions on samples of any previous task. Its goal is to establish task models $L_{1},\ldots,L_{m}$ such that:

•

Classification Accuracy: each learned metric $L_{t}$ should classify the new samples more accurate.

•

Computation Efficiency: in the training period, each $L_{t}$ should be updated faster than traditional multi-task metric learning (i.e., joint learning models).

•

Lifelong Learning: new $L_{t}$ ’s can be added arbitrarily and efficiently when the lifelong system encounters new metric tasks.

III-C Lifelong Metric Learning Framework (LML)

In order to model the correlation among different metric tasks, we assume that the metric matrix $f_{t}$ for $t$ -th task can be represented using a combination of the shared common subspace from a knowledge repository. Moreover, motivated by [12], Theorem 1 gives the detail mathematical description.

Theorem 1

Let $f_{L_{t}}(x_{ti},x_{tj})$ denotes the similarity / distance of $x_{ti},x_{tj}\in\mathbb{R}^{\hat{d}}$ defined by the transformation matrix $L_{t}$ as Eq. (1) or Eq. (2). For any $L_{t}\in\mathbb{R}^{d\times\hat{d}}$ ( $d\ll\hat{d}$ ), there exists a low dimensional subspace $\mathbb{S}_{t}$ spanned by orthonormal basis $\{p_{t1},\ldots,p_{td}\}$ with metric matrix defined by $R_{t}\in\mathbb{R}^{d\times d}$ so that

[TABLE]

where $\hat{x}_{ti}=P_{t}^{T}x_{ti}=[p_{ti},\ldots,p_{td}]^{T}x_{ti}\in\mathbb{R}^{d}$ is the coordinate of the projection of $x_{ti}$ in $\mathbb{S}_{t}$ with respect to basis matrix $P_{t}$ .

Therefore, metric matrix $L_{t}$ for $t$ -th task in Theorem 1 can be explicitly decomposed to a low-dimensional metric part $R_{t}$ and a subspace part $P_{t}$ . Our Lifelong Metric Learning (LML) framework can be simply represented as to learn an individual metric $R_{t}$ for each task in a common subspace $P_{t}^{T}=L_{0}$ . Furthermore, as shown in Fig. 2, parameter matrix $M_{t}\in\mathbb{R}^{\hat{d}\times\hat{d}}$ for metric task $f_{t}$ can be expressed as:

[TABLE]

where $W_{t}\in\mathbb{R}^{d\times d}$ denotes the weight matrix. Therefore, each metric task $M_{t}$ can be represented as a linear combination of “lifelong dictionary” composed by $l_{i}l_{j},\forall i,j=1,\ldots,d$ . Generally, since diagonal elements in $W_{t}$ represents the self-correlation of a transformed feature while off-diagonal element represents correlation among different transformed features, diagonal elements should be more dense than those off-diagonal elements. We encourage the off-diagonal elements of $W_{t}$ ’s to be sparse (i.e., use few components among lifelong dictionary) in order to ensure that each learned metric model captures a maximal reusable chunk of knowledge.

Given the training data for each task, we optimize the metrics to minimize the loss function over all tasks while encouraging the metrics to share common knowledge in lifelong dictionary. Therefore, LML framework can be formulated as:

[TABLE]

where the $\left\|\cdot\right\|_{\mathrm{1,off}}$ -norm of $W_{t}$ defined as $\sum_{i\neq j}|W_{t,ij}|$ is used as a convex approximation to the true matrix sparsity, and $\left\|L_{0}\right\|_{F}=(\mathrm{tr}(L_{0}L_{0}^{T}))^{1/2}$ is the Frobenius norm of matrix $L_{0}$ to avoid overfitting. The trade-off parameter $\lambda_{t}\geq 0$ controls the regularization of $\left\|W_{t}\right\|_{\mathrm{1,off}}$ for all $t=1,\ldots,m$ . If $\lambda_{t}\rightarrow\infty$ , the task-specific matrices $W_{t}$ ’s become self-correlation diagonal matrices. With the definition of $\ell_{t}$ in Eq. (3), the final optimization problem of lifelong metric learning can be formulated as:

[TABLE]

where $(i,j)\in\mathcal{S}_{t}\cup\mathcal{D}_{t}$ are the side-information in the $t$ -th metric task.

IV Model Optimization

This section provides the detail procedure of how to optimize our proposed LML framework. Since the problem in Eq. (6) is not convex with respect to $L_{0}$ and $W_{t}$ ’s jointly, the objective function can arrive at a local optimum. A common approach for computing such a local optimum for objective functions in Eq. (6) is to alternately perform two convex optimization steps: one in which $L_{0}$ is optimized by fixing the $W_{t}$ ’s, and another in which the $W_{t}$ ’s are optimized by holding $L_{0}$ fixed. However, as shown in [38], this approach is inefficient and inapplicable to lifelong learning with many tasks and data samples. This is because that in order to optimize $L_{0}$ , the problem in Eq. (6) has to recompute the value of each $W_{t}$ ’s (which will become time consumption when increasing the number of learned tasks $m$ ). To address this problem, we aim to approximate Eq. (6) by applying the online passive aggressive (PA) [39] optimization strategy, i.e.,

[TABLE]

where $\eta$ is the learning rate. After linearizing the loss function $\ell_{t}$ around $L_{0}^{T}W_{t}L_{0}=M_{t}$ , we obtain the following new online function:

[TABLE]

where $G_{t}$ is the gradient of $\ell_{t}$ . We then rewrite the optimization problem in Eq. (8) as:

[TABLE]

In Eq. (9), we have suppressed the constant term of the linearize form (since it does not affect the minimum). Crucially, we have removed the dependence of the optimization problem Eq. (6) on the number of the data samples $n_{1},\ldots,n_{t}$ in each task. Additionally, Eq. (9) can be reformulate as:

[TABLE]

where $M_{t}^{*}=M_{t}-\eta G_{t}$ can be approximated from the large samples by online learning or small samples by offline mini-batch learning in the $t$ -th task. Moreover, the optimization problem in Eq. (10) also can be roughly divided into two subproblems with alternating direction optimization strategy. After initializing the lifelong dictionary $L_{0}$ , the first subproblem is to compute the optimal $W_{t}$ for the new coming task $M_{t}^{*}$ , and the second subproblem is to update the lifelong dictionary $L_{0}$ by fixing $W_{t}$ ’s.

IV-A Lifelong Dictionary $L_{0}$ Initialization

An high-quality lifelong dictionary plays an important role in our model. In order to generate a set of discriminative basis vectors in $L_{0}$ , we first divide data into different clusters. For each clutter, we select $J$ nearest neighbors from each class (for $J=|{10,20,50|}$ to count for different scales), and apply Fisher discriminative analysis followed by eigenvalue decomposition to obtain the basis elements.

IV-B Solve $W_{t}$ with Given $L_{0}$

With the initialized lifelong dictionary $L_{0}$ , $W_{t}$ is the variable in this subproblem. The optimization function can be rewritten as:

[TABLE]

where $f(W_{t})=\frac{1}{2}\left\|L_{0}^{T}W_{t}L_{0}-M_{t}^{*}\right\|_{F}^{2}$ and $g(W_{t})=\lambda_{t}\left\|W_{t}\right\|_{1,\mathrm{off}}$ . Due to the non-smooth nature of $g(W_{t})$ , we propose the proximal gradient method (FISTA) [40] with a fast global convergence rate to solve this optimization problem. Specifically, the proximal operator of the $\ell_{1,\mathrm{off}}$ -norm can be applied to solve this subproblem:

[TABLE]

where $\eta_{i}>0$ is the stepsize parameter, Eq. (12) can be appropriately determined by the backtracking rule. $\nabla f(W_{t}^{i})$ is the gradient matrix with respect to $f(W_{t}^{i})$ can be expressed as:

[TABLE]

With the gradient of $f(W_{t}^{i})$ , the optimal $W_{t}^{i+1}$ depends on the proximity operator of the $\ell_{1,\mathrm{off}}$ -norm, i.e., soft thresholding operator:

[TABLE]

where $\odot$ denotes the elementwise multiplication. Notice that FISTA amounts for using two sequences $\{W_{t}^{i}\}$ and $\{V_{t}^{i}\}$ in which $\{W_{t}^{i}\}$ is the approximate solution and $\{V_{t}^{i}\}$ is search points. Moreover, the proximal method by Solving for $W_{t}$ is summarized in Algorithm 1.

IV-C Solve $L_{0}$ with Given $\mathcal{T}_{t}$ ’s and $W_{t}$ ’s

In order to evaluate the lifelong dictionary $L_{0}$ , we modify the formulation in Eq. (7) to remove the minimization over $W_{t}$ . Besides, we also remove the second term which is used to keep the new similarity / distance matrix close to the current one. Further, we accomplish this by exploiting both side-information $\mathcal{T}_{t}$ (generated according to the adopted base metric learning model) and $W_{t}$ in the learned tasks. In the following, we try to adopt the gradient descent method to solve $L_{0}$ in Eq. (7). The gradient of $\ell_{t}$ with respect to $L_{0}$ is:

[TABLE]

where $\triangle_{t}$ can be calculated with different SingleTaskLearner function using side-information $\mathcal{T}_{t}$ . The LML framework is summarized in Algorithm 2, where SingleTaskLearner is learned using base metric models.

IV-D Computational Complexity

For the complexity of our proposed algorithm, the main computational cost in each update in Algorithm 2 involves two subproblems: one optimization problem lies in Eq. (11), another one is Eq. (14).

For the problem in Eq. (11), each update for the LML system begins by base metric learning models to compute $M_{t}$ , we assume that this step has complexity $O(\xi(\hat{d},n))$ , where $\hat{d}$ is the number of feature, $n=\sum_{t=1}^{m}n_{t}$ and $n_{t}$ is the triplets number of $\mathcal{T}_{t}$ in our paper. Next, to update $W_{t}$ requires solving the instance of lasso, i.e., $\left||W\right\|_{1,\mathrm{off}}$ . Each iteration in this problem begins by the computation of the gradient of $W_{t}$ , and the computational complexity is $O(d^{2}\hat{d}+2d^{3}+2\hat{d}^{3})$ . Therefore, the cost for achieving $\epsilon$ -accuracy is $O((d^{2}\hat{d}+2d^{3}+2\hat{d}^{3}+\xi(\hat{d},n))/\sqrt{\epsilon})$ , where $\epsilon$ is determined by the convergence property of Accelerated Gradient Method, i.e., $\mathcal{O}(1/\sqrt{\epsilon})$ with

[TABLE]

In other words, the convergence rate of Algorithm 1 can achieve $\mathcal{O}(1/T^{2})$ as shown in [40]. Moreover, the multiplication of two matrices can be further reduced with Coppersmith-Winograd algorithm. 2. 2.

The optimization algorithm for solving $L_{0}$ involves the gradient of each triplet in $\mathcal{T}_{t}$ , and the computational complexity is $O(\hat{d}^{2}d+\hat{d}d^{2})$ .

Finally, the overall complexity of each update in Algorithm 2 is $O(\hat{d}^{2}d+\hat{d}d^{2}+(d^{2}\hat{d}+2d^{3}+2\hat{d}^{3}+\xi(\hat{d},n))/\sqrt{\epsilon})$ .

IV-E Discussion

In this section, we briefly review one learning method that is most related to our proposed learning algorithm. Perhaps the most relevant work to ours in the context of multi-task metric learning is from [37], which frames metric learning as learning a sparse combination of locally discriminative metrics that are generated from the training data via clustering. However, the motivation for SCML and our LML are significantly different:

•

SCML aims to cast metric learning as learning a sparse combination of basis elements taken from a basis set $B=\{b_{i}\}_{i=1}^{K}$ , where the $b_{i}$ ’s are $\hat{d}$ -dimensional column vectors. Instead of fixing the metric task number, our LML focuses on the transfer of knowledge from preciously learned tasks to the new metric task using the shared basis, i.e., lifelong task learning.

•

In SCML, the metric matrix is represented using basis matrices induced by a $\ell_{1}$ -norm constraint, and the formulation in SCML can only achieve batch learning. Our LML encourages the communication among different basis elements via a $\ell_{1,\mathrm{off}}$ -norm constraint, and the resulting formulation can integrate online sample learning by adopting the SinlgeTaskLearner as online metric learning. Furthermore, we have also conducted extensive experiments on the effect of $\ell_{1,\mathrm{off}}$ -norm in the experiment section.

•

The optimization algorithm in SCML can only find a local solution with all the training samples. The proposed algorithm for Eq. (10) can learn new metric task without accessing to historical data because only the gradient information of previous training data is adopted in the next iteration in Eq. (14).

V Experiments

In this section, we carry out empirical comparisons with the state-of-the-art single and multi-task metric learning models. We first give the base metric learning with our lifelong metric learning framework in Table I with two different function: ${s_{M_{t}}(x_{i},x_{j})=x_{i}^{T}M_{t}x_{j}}$ and ${d_{M_{t}}(x_{i},x_{j})=(x_{i}-x_{j})^{T}M_{t}(x_{i}-x_{j})}$ , where $x_{i}$ and $x_{j}$ belong to the same class, while $x_{i}$ and $x_{k}$ are from different classes. The experiments are then conducted on a series of real datasets.

V-A Comparison Algorithms and Evaluation

In our experiments, we compare our LML framework with single metric learning models and multi-task metric learning models. The single metric learning model includes:

Euclidean distance (stEuc): the standard Euclidean distance in feature space; 2) OASIS (stOASIS) [9]: the classical online metric learning model which is given in Table I, and its iteration number is $2\times 10^{4}$ in our paper; 3) LMNN (stLMNN) [1]: Large Margin Nearest Neighbor Classification, which learns a Mahalanobis distance for $k$ -nearest neighbor classification; 4) SCML-global (stSCML) [37]: which is simply to combine the local basis elements into a higher-rank global metric; 5) LMNN-union (uLMNN): is the LMNN metric obtained on the union of the training data of all tasks (i.e., “pooling” all the training data and ignoring the multi-task aspect). 6) SCML-union (uSCML): is the SCML metric obtained on the union of the training data of all metric tasks.

For the multi-task metric learning models, the comparison models include:

•

multi-task LMNN (mtLMNN) [11]: common metric defined by $M_{0}$ picks up general trends across multiple datasets and $M_{t}$ specializes the metric further for each particular task.

•

multi-task SCML (mtSCML) [37]: this multi-task metric learning model considers that all learned metrics can be expressed as combinations of the same basis subset $B$ , though with different weights for each task.

For the classical lifelong multi-task learning, we adopt the comparison model as:

•

Lifelong multi-task (ELLA) [38]: whose formulation is realized by the following objective function:

[TABLE]

where $(x_{i}^{(t)},y_{i}^{(t)})$ is the $i$ -th labeled training samples for $t$ -th task, $\mathcal{L}$ is a known loss function. Specifically, ELLA maintains a sparsely shared basis vector for all regression or logistic task models, transfers knowledge from the basis to learn new $t$ -th task.

All the models are implemented in MATLAB, and the codes are available at the supplement website. Notice that all the parameters of the models are tuned in $\{10,1,0.1,0.01,0.001\}$ and selected via 5-fold cross validation. Although our model allows different weights $\lambda_{t}$ for each task, throughout this paper we only adjust our parameters: $\gamma$ and $\lambda=\lambda_{t}>0$ . All the experiments are performed on the computer with 12G RAM, Intel i7 CPU.

V-B Real Datasets

According to whether the label is consistent or not, we categorize the real datasets into two different scenarios: label-consistent and label-inconsistent. In the following, we will demonstrate the effectiveness of our proposed LML framework in the different datasets.

Label-consistent datasets: the label set is shared by all the metric tasks, which can be roughly categorized as: same metric task and different metric tasks with same label set. Therefore, depending on whether is the same task or not, we adopt two datasets in this paper. As shown in Table IV, Sentiment [41] consists of Amazon reviews on four product types (kitchen appliances, DVDs, books and electronics). We randomly split the dataset into training ( $800$ samples), validation ( $400$ samples) and testing ( $400$ samples) sets. Isolet 111http://www.cad.zju.edu.cn/home/dengcai/Data/MLData.html dataset, which is a popular dataset for multi-task learning consists of 5 disjoint subjects called isolet 1-5. We randomly split each task of the dataset into training ( $10\%$ samples), validation ( $20\%$ samples) and testing ( $70\%$ samples) sets. Moreover, we set the basis number of stSCML as 100 and 500 in Sentiment and Isolet, respectively.

The experimental results averaged over five random repetitions are presented in Table II, and we can conclude that:

•

Compared with other competing methods, our proposed LML framework outperforms the state-of-the-arts with the average error as $20.3$ and $21.7$ and achieving $2.6\%$ and $0.6\%$ improvement in term of classification error using Sentiment and Isolet datasets, which verifies the effectiveness of our LML framework in a lifelong learning manner. Furthermore, the performance of our LML framework is also better than the existing lifelong learning model (ELLA), due to the fact that we adopt the lifelong dictionary in LML framework, which keeps on learning step-by-steps.

•

For the real datasets, Table II also shows that the comparison of time consumption between our LML framework and other single / multi-task metric models. Our LML framework is more efficient than most state-of-the-arts due to we do not need to retrain all the previous tasks. However, our LML framework is little slower than the multi-task metric model in Isolet and faster than LMNN. This is because we set the high dimensional transformed features.

•

Similarity metric function outperforms distance metric function on Sentiment dataset, which implies that similarity metric may be important for different tasks; meanwhile, distance metric outperforms similarity metric on the Isolet dataset, which implies that distance metric may be important for this metric task.

Label-inconsistent datasets: the label set of coming metric task is different from the learned metric tasks. USPS 222http://statweb.stanford.edu/ tibs/ElemStatLearn/data.html dataset consists of 7291 $16\times 16$ gray-scale images of digits $0-9$ automatically. The features are $256$ -d grayscale values. We split all the classification problem into 4 tasks: $\{0,1\},\{2,3\},\{4,5,6\}$ and $\{7,8,9\}$ , respectively. Therefore, the number of classes of each task is 2, 2, 3, 3. We use randomly selected $10\%$ of the all the samples as training samples while the remaining for test. From the presented result in Table III, we can notice that the performance of our LML framework leads to the second best one with the average error as $2.68$ , only $0.02$ worse than the best one mtSCML and outperforms other state-of-the-arts with a big gap. This is because our LML only train the model using the data from the only one corresponding task, instead of mtSCML adopting the data of all tasks together for model training. That is why ours is more efficient than mtSCML as shown in Table II.

V-C Evaluating Lifelong Metric learning Framework

In this subsection, we conduct comparisons on the proposed lifelong metric learning formulation, and study how the learned task impact its generalization performance.

V-C1 Effect of the $\left\|W\right\|_{1,\mathrm{off}}$ -norm Regularization

In order to study how the $\left\|W\right\|_{1,\mathrm{off}}$ -norm regularization affect the performance of the single metric task, we compare the stSCML method with our proposed framework Eq. (5) on the Isolet dataset. Specifically, we remove the regularization term of $L_{0}$ in Eq. (5), i.e., $\gamma\left\|L_{0}\right\|_{F}^{2}$ , and employ FISTA [40] to efficiently optimize such a convex problem. We also randomly split each task of the Isolet dataset into training ( $10\%$ samples), validation ( $20\%$ samples) and testing ( $70\%$ samples) sets, and the performance (averaged over 5 random repetitions) is presented in Fig. 3. In general, our model in Eq. (5) outperforms stSCML on all the single learned task expect for task Isolet2. This observation verities that the correlation information among different transformed features enables to improve the learning efficacy, i.e., the effectiveness of $\left\|W\right\|_{1,\mathrm{off}}$ -norm.

V-C2 Effect of the Dimension of Transformed Features

In this subsection, we utilize the Sentiment dataset to evaluate how the dimension of transformed features $d$ affect the performance of our LML framework (Ours+OASIS) in term of classification error. Specifically, we also randomly split the dataset into training (800 samples), validation (400 samples), and testing (400 samples) sets. By varying the number of transformed feature $d$ from 40 to 200, we present the performance (averaged over 5 random repetitions) as shown in Fig. 4. Notice that average classification error changes with different number of transformed features, which verifies that all the metric task should be embedded in a low-dimensional subspace, namely lifelong dictionary in our LML framework. In addition, the error of average classification is minimum when $d=120$ , i.e., the performance of our LML is best. After that, the classification error is decreasing with the increase of size of $d$ . This is because that the larger size of $d$ , the more redundant feature information can be involved in the lifelong dictionary $L$ .

V-C3 Effect of the Number of Learned Tasks

In this subsection, we also adopt Sentiment dataset to study how the number of learned tasks $t$ affect the classification performance of our LML framework. After splitting the dataset into training (800 samples), validation (400 samples) and testing (400 samples) sets, we set the sequence of learned $t$ tasks as: Books, DVD, Electronics and Kitchen; we present the classification performance (averaged over 5 random repetitions) in Fig. 5. Obviously, as each metric task is imposed step-by-step, the error of our LML is decreased, i.e., the performance of our LML framework is improved gradually, which justifies that our LML framework can accumulate knowledge continuously and achieve lifelong learning like “human learning”. In addition, the performance of early learned tasks are improved more obviously than succeeding task, i.e., the early tasks can benefit from the accumulated knowledge.

VI CONCLUSION

In this paper, we study how to add metric task into original metric system without retraining the whole system in a too time consuming way as most state-of-the-art online metric learning models. Specifically, we propose lifelong metric learning (LML) framework, which learns “lifelong dictionary” as shared basis for all metric models based on the assumption that all metric tasks are retained in a low-dimensional common subspace. When new metric task arrives, our LML can transfer knowledge through the shared lifelong dictionary to learn the new coming metric model with sparsity regularization, and redefine the basis metrics with knowledge from the new metric task. After converting this convex problem into two subproblems via Online Passive Aggressive optimization, we adopt proximal gradient method to solve our proposed LML framework. Through extensive experiments carried our on several multi-task datasets, we verify that our proposed framework are well suited to the lifelong learning problem, and exhibit prominent performance in both effectiveness and efficiency.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” Journal of Machine Learning Research , vol. 10, no. Feb, pp. 207–244, 2009.
2[2] Z.-J. Zha, T. Mei, M. Wang, Z. Wang, and X.-S. Hua, “Robust distance metric learning with auxiliary knowledge.” in International Joint Conference on Artificial Intelligence , 2009, pp. 1327–1332.
3[3] M. S. Baghshah and S. B. Shouraki, “Semi-supervised metric learning using pairwise constraints.” in International Joint Conference on Artificial Intelligence , vol. 9. Citeseer, 2009, pp. 1217–1222.
4[4] F. Wang and J. Sun, “Survey on distance metric learning and dimensionality reduction in data mining,” Data Mining and Knowledge Discovery , vol. 29, no. 2, pp. 534–564, 2015.
5[5] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” in Proceedings of the 24th international conference on Machine learning . ACM, 2007, pp. 209–216.
6[6] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in CVPR . IEEE, 2009, pp. 413–420.
7[7] Y. Luo, T. Liu, D. Tao, and C. Xu, “Decomposition-based transfer distance metric learning for image classification,” IEEE Transactions on Image Processing , vol. 23, no. 9, pp. 3789–3801, 2014.
8[8] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online metric learning and fast similarity search,” in Advances in neural information processing systems , 2009, pp. 761–768.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Lifelong Metric Learning

Abstract

Index Terms:

I Introduction

II RELATED WORKS

II-A Single Metric Learning

II-B Multi-task Metric Learning

III Lifelong Metric Learning

III-A Preliminaries

III-B The Lifelong Metric Learning Problem

III-C Lifelong Metric Learning Framework (LML)

Theorem 1

IV Model Optimization

IV-A Lifelong Dictionary L0L_{0}L0​ Initialization

IV-B Solve WtW_{t}Wt​ with Given L0L_{0}L0​

IV-C Solve L0L_{0}L0​ with Given Tt\mathcal{T}_{t}Tt​’s and WtW_{t}Wt​’s

IV-D Computational Complexity

IV-E Discussion

V Experiments

V-A Comparison Algorithms and Evaluation

V-B Real Datasets

V-C Evaluating Lifelong Metric learning Framework

V-C1 Effect of the ∥W∥1,off\left\|W\right\|_{1,\mathrm{off}}∥W∥1,off​-norm Regularization

V-C2 Effect of the Dimension of Transformed Features

V-C3 Effect of the Number of Learned Tasks

VI CONCLUSION

IV-A Lifelong Dictionary $L_{0}$ Initialization

IV-B Solve $W_{t}$ with Given $L_{0}$

IV-C Solve $L_{0}$ with Given $\mathcal{T}_{t}$ ’s and $W_{t}$ ’s

V-C1 Effect of the $\left\|W\right\|_{1,\mathrm{off}}$ -norm Regularization