Heterogeneous Multi-task Metric Learning across Multiple Domains

Yong Luo; Yonggang Wen; Dacheng Tao

arXiv:1904.04081·stat.ML·April 9, 2019

Heterogeneous Multi-task Metric Learning across Multiple Domains

Yong Luo, Yonggang Wen, Dacheng Tao

PDF

TL;DR

This paper introduces a novel heterogeneous multi-task metric learning framework that effectively learns metrics across multiple diverse domains by leveraging high-order statistical information, improving transfer learning in real-world applications.

Contribution

The paper proposes a new HMTML framework that handles multiple heterogeneous domains simultaneously, utilizing high-order covariance to improve metric learning across diverse data sources.

Findings

01

HMTML outperforms existing methods in text categorization.

02

HMTML achieves better scene classification accuracy.

03

HMTML improves social image annotation results.

Abstract

Distance metric learning (DML) plays a crucial role in diverse machine learning algorithms and applications. When the labeled information in target domain is limited, transfer metric learning (TML) helps to learn the metric by leveraging the sufficient information from other related domains. Multi-task metric learning (MTML), which can be regarded as a special case of TML, performs transfer across all related domains. Current TML tools usually assume that the same feature representation is exploited for different domains. However, in real-world applications, data may be drawn from heterogeneous domains. Heterogeneous transfer learning approaches can be adopted to remedy this drawback by deriving a metric from the learned transformation across different domains. But they are often limited in that only two domains can be handled. To appropriately handle multiple domains, we develop a…

Tables3

Table 1. TABLE I: Average accuracy and macroF1 score of all domains of the compared methods at their best numbers (of common factors) on the RMLC dataset. In each domain, the number of labeled training samples for each category varies from 5 5 5 to 15 15 15 .

	Accuracy			MacroF1
Methods	5	10	15	5	10	15
EU	0.382 $\pm$ 0.023	0.442 $\pm$ 0.012	0.491 $\pm$ 0.012	0.369 $\pm$ 0.030	0.423 $\pm$ 0.012	0.479 $\pm$ 0.022
LMNN	0.412 $\pm$ 0.020	0.461 $\pm$ 0.017	0.514 $\pm$ 0.017	0.400 $\pm$ 0.029	0.454 $\pm$ 0.021	0.495 $\pm$ 0.015
ITML	0.489 $\pm$ 0.038	0.561 $\pm$ 0.032	0.585 $\pm$ 0.026	0.486 $\pm$ 0.028	0.556 $\pm$ 0.024	0.580 $\pm$ 0.024
RDML	0.431 $\pm$ 0.014	0.500 $\pm$ 0.029	0.565 $\pm$ 0.013	0.412 $\pm$ 0.019	0.466 $\pm$ 0.023	0.550 $\pm$ 0.017
DAMA	0.544 $\pm$ 0.021	0.599 $\pm$ 0.029	0.602 $\pm$ 0.009	0.533 $\pm$ 0.013	0.583 $\pm$ 0.027	0.589 $\pm$ 0.012
MTDA	0.486 $\pm$ 0.031	0.598 $\pm$ 0.015	0.666 $\pm$ 0.008	0.493 $\pm$ 0.023	0.585 $\pm$ 0.016	0.647 $\pm$ 0.012
HMTML	0.559 $\pm$ 0.005	0.645 $\pm$ 0.013	0.686 $\pm$ 0.003	0.543 $\pm$ 0.015	0.626 $\pm$ 0.010	0.666 $\pm$ 0.007

Table 2. TABLE II: Average accuracy and macroF1 score of all domains of the compared methods at their best numbers (of common factors) on the Scene-15 dataset. In each domain, the number of labeled training examples for each category varies from 4 4 4 to 8 8 8 .

	Accuracy			MacroF1
Methods	4	6	8	4	6	8
EU	0.321 $\pm$ 0.018	0.337 $\pm$ 0.012	0.349 $\pm$ 0.010	0.309 $\pm$ 0.017	0.326 $\pm$ 0.012	0.337 $\pm$ 0.011
LMNN	0.326 $\pm$ 0.014	0.360 $\pm$ 0.006	0.392 $\pm$ 0.011	0.310 $\pm$ 0.010	0.342 $\pm$ 0.005	0.377 $\pm$ 0.011
ITML	0.336 $\pm$ 0.017	0.375 $\pm$ 0.009	0.389 $\pm$ 0.011	0.324 $\pm$ 0.017	0.365 $\pm$ 0.009	0.379 $\pm$ 0.011
RDML	0.322 $\pm$ 0.005	0.341 $\pm$ 0.004	0.358 $\pm$ 0.005	0.308 $\pm$ 0.009	0.332 $\pm$ 0.008	0.347 $\pm$ 0.007
DAMA	0.328 $\pm$ 0.009	0.377 $\pm$ 0.016	0.411 $\pm$ 0.010	0.320 $\pm$ 0.007	0.358 $\pm$ 0.012	0.391 $\pm$ 0.009
MTDA	0.350 $\pm$ 0.008	0.381 $\pm$ 0.006	0.394 $\pm$ 0.008	0.325 $\pm$ 0.011	0.351 $\pm$ 0.006	0.366 $\pm$ 0.008
HMTML	0.366 $\pm$ 0.008	0.406 $\pm$ 0.011	0.427 $\pm$ 0.004	0.334 $\pm$ 0.009	0.381 $\pm$ 0.012	0.401 $\pm$ 0.004

Table 3. TABLE III: Average accuracy and macroF1 score of all domains of the compared methods at their best numbers (of common factors) on the NUS animal subset. In each domain, the number of labeled training instances for each concept varies from 4 4 4 to 8 8 8 .

	Accuracy			MacroF1
Methods	4	6	8	4	6	8
EU	0.161 $\pm$ 0.011	0.172 $\pm$ 0.008	0.177 $\pm$ 0.009	0.153 $\pm$ 0.013	0.167 $\pm$ 0.009	0.175 $\pm$ 0.007
LMNN	0.169 $\pm$ 0.006	0.176 $\pm$ 0.001	0.185 $\pm$ 0.004	0.161 $\pm$ 0.008	0.173 $\pm$ 0.004	0.177 $\pm$ 0.005
ITML	0.172 $\pm$ 0.012	0.180 $\pm$ 0.008	0.188 $\pm$ 0.006	0.171 $\pm$ 0.012	0.177 $\pm$ 0.009	0.188 $\pm$ 0.007
RDML	0.164 $\pm$ 0.009	0.172 $\pm$ 0.009	0.184 $\pm$ 0.005	0.156 $\pm$ 0.005	0.160 $\pm$ 0.011	0.176 $\pm$ 0.006
DAMA	0.165 $\pm$ 0.008	0.171 $\pm$ 0.012	0.179 $\pm$ 0.003	0.168 $\pm$ 0.006	0.171 $\pm$ 0.004	0.175 $\pm$ 0.005
MTDA	0.164 $\pm$ 0.019	0.178 $\pm$ 0.009	0.188 $\pm$ 0.010	0.157 $\pm$ 0.022	0.177 $\pm$ 0.011	0.191 $\pm$ 0.010
HMTML	0.184 $\pm$ 0.010	0.191 $\pm$ 0.004	0.195 $\pm$ 0.004	0.178 $\pm$ 0.010	0.189 $\pm$ 0.003	0.198 $\pm$ 0.007

Equations44

d_{A} (x_{i}, x_{j}) = (x_{i} - x_{j})^{T} A (x_{i} - x_{j}),

d_{A} (x_{i}, x_{j}) = (x_{i} - x_{j})^{T} A (x_{i} - x_{j}),

= B (i_{1}, \dots, i_{m - 1}, j_{m}, i_{m + 1}, \dots, i_{M}) i_{m} = 1 \sum I_{m} A (i_{1}, i_{2}, \dots, i_{M}) U (j_{m}, i_{m}) .

= B (i_{1}, \dots, i_{m - 1}, j_{m}, i_{m + 1}, \dots, i_{M}) i_{m} = 1 \sum I_{m} A (i_{1}, i_{2}, \dots, i_{M}) U (j_{m}, i_{m}) .

B = A \times_{1} U_{1} \times_{2} U_{2} \dots \times_{M} U_{M} .

B = A \times_{1} U_{1} \times_{2} U_{2} \dots \times_{M} U_{M} .

B (i_{1}, \dots, i_{m - 1}, i_{m + 1}, \dots, i_{M}) = i_{m} = 1 \sum I_{m} A (i_{1}, i_{2}, \dots, i_{M}) u (i_{m}) .

B (i_{1}, \dots, i_{m - 1}, i_{m + 1}, \dots, i_{M}) = i_{m} = 1 \sum I_{m} A (i_{1}, i_{2}, \dots, i_{M}) u (i_{m}) .

∥ A ∥_{F}^{2} = ⟨ A, A ⟩ = i_{1} = 1 \sum I_{1} i_{2} = 1 \sum I_{2} \dots i_{M} = 1 \sum I_{M} A (i_{1}, i_{2}, \dots, i_{M})^{2} .

∥ A ∥_{F}^{2} = ⟨ A, A ⟩ = i_{1} = 1 \sum I_{1} i_{2} = 1 \sum I_{2} \dots i_{M} = 1 \sum I_{M} A (i_{1}, i_{2}, \dots, i_{M})^{2} .

min_{{A_{m}}_{m = 1}^{M}} s.t. F ({A_{m}}) = m = 1 \sum M Ψ (A_{m}) + γ R (A_{1}, A_{2}, \dots, A_{M}), A_{m} ⪰ 0, m = 1, 2, \dots, M,

min_{{A_{m}}_{m = 1}^{M}} s.t. F ({A_{m}}) = m = 1 \sum M Ψ (A_{m}) + γ R (A_{1}, A_{2}, \dots, A_{M}), A_{m} ⪰ 0, m = 1, 2, \dots, M,

min_{U_{1}, U_{2}} \frac{1}{P} s.t. p = 1 \sum P ∥ v_{1}^{p} - v_{2}^{p} ∥_{2}^{2} + m = 1 \sum 2 γ_{m} ∥ U_{m} ∥_{1}, U_{1}, U_{2} ⪰ 0,

min_{U_{1}, U_{2}} \frac{1}{P} s.t. p = 1 \sum P ∥ v_{1}^{p} - v_{2}^{p} ∥_{2}^{2} + m = 1 \sum 2 γ_{m} ∥ U_{m} ∥_{1}, U_{1}, U_{2} ⪰ 0,

min_{U_{1}, U_{2}} \frac{1}{P} s.t. p = 1 \sum P ∥ w_{1}^{p} - G w_{2}^{p} ∥_{2}^{2} + m = 1 \sum 2 γ_{m} ∥ U_{m} ∥_{1}, U_{1}, U_{2} ⪰ 0,

min_{U_{1}, U_{2}} \frac{1}{P} s.t. p = 1 \sum P ∥ w_{1}^{p} - G w_{2}^{p} ∥_{2}^{2} + m = 1 \sum 2 γ_{m} ∥ U_{m} ∥_{1}, U_{1}, U_{2} ⪰ 0,

min_{{U_{m}}_{m = 1}^{M}} s.t. \frac{1}{P} p = 1 \sum P ∥ w_{1}^{p} - G \overset{ˉ}{\times}_{2} (w_{2}^{p})^{T} \dots \overset{ˉ}{\times}_{M} (w_{M}^{p})^{T} ∥_{2}^{2} + m = 1 \sum M γ_{m} ∥ U_{m} ∥_{1}, U_{m} ⪰ 0, m = 1, \dots, M,

min_{{U_{m}}_{m = 1}^{M}} s.t. \frac{1}{P} p = 1 \sum P ∥ w_{1}^{p} - G \overset{ˉ}{\times}_{2} (w_{2}^{p})^{T} \dots \overset{ˉ}{\times}_{M} (w_{M}^{p})^{T} ∥_{2}^{2} + m = 1 \sum M γ_{m} ∥ U_{m} ∥_{1}, U_{m} ⪰ 0, m = 1, \dots, M,

min_{{U_{m}}_{m = 1}^{M}} F ({U_{m}}) = m = 1 \sum M \frac{1}{N _{m}^{'}} k = 1 \sum N_{m}^{'} g (y_{mk} (1 - δ_{mk}^{T} U_{m} U_{m}^{T} δ_{mk})) + \frac{γ}{P} p = 1 \sum P ∥ w_{1}^{p} - G \overset{ˉ}{\times}_{2} (w_{2}^{p})^{T} \dots \overset{ˉ}{\times}_{M} (w_{M}^{p})^{T} ∥_{2}^{2} + m = 1 \sum M γ_{m} ∥ U_{m} ∥_{1}, s.t. U_{m} ⪰ 0, m = 1, 2, \dots, M .

min_{{U_{m}}_{m = 1}^{M}} F ({U_{m}}) = m = 1 \sum M \frac{1}{N _{m}^{'}} k = 1 \sum N_{m}^{'} g (y_{mk} (1 - δ_{mk}^{T} U_{m} U_{m}^{T} δ_{mk})) + \frac{γ}{P} p = 1 \sum P ∥ w_{1}^{p} - G \overset{ˉ}{\times}_{2} (w_{2}^{p})^{T} \dots \overset{ˉ}{\times}_{M} (w_{M}^{p})^{T} ∥_{2}^{2} + m = 1 \sum M γ_{m} ∥ U_{m} ∥_{1}, s.t. U_{m} ⪰ 0, m = 1, 2, \dots, M .

∥ w_{1}^{p} - G \overset{ˉ}{\times}_{2} (w_{2}^{p})^{T} \dots \overset{ˉ}{\times}_{M} (w_{M}^{p})^{T} ∥_{2}^{2} = ∥ w_{1}^{p} \circ w_{2}^{p} \dots \circ w_{M}^{p} - G ∥_{F}^{2},

∥ w_{1}^{p} - G \overset{ˉ}{\times}_{2} (w_{2}^{p})^{T} \dots \overset{ˉ}{\times}_{M} (w_{M}^{p})^{T} ∥_{2}^{2} = ∥ w_{1}^{p} \circ w_{2}^{p} \dots \circ w_{M}^{p} - G ∥_{F}^{2},

min_{{U_{m}}_{m = 1}^{M}} F ({U_{m}}) = m = 1 \sum M \frac{1}{N _{m}^{'}} k = 1 \sum N_{m}^{'} g (y_{mk} (1 - δ_{mk}^{T} U_{m} U_{m}^{T} δ_{mk})) + \frac{γ}{P} p = 1 \sum P ∥ W^{p} - E_{r} \times_{1} U_{1} \times_{2} U_{2} \dots \times_{M} U_{M} ∥_{F}^{2} + m = 1 \sum M γ_{m} ∥ U_{m} ∥_{1}, s.t. U_{m} ⪰ 0, m = 1, 2, \dots, M,

min_{{U_{m}}_{m = 1}^{M}} F ({U_{m}}) = m = 1 \sum M \frac{1}{N _{m}^{'}} k = 1 \sum N_{m}^{'} g (y_{mk} (1 - δ_{mk}^{T} U_{m} U_{m}^{T} δ_{mk})) + \frac{γ}{P} p = 1 \sum P ∥ W^{p} - E_{r} \times_{1} U_{1} \times_{2} U_{2} \dots \times_{M} U_{M} ∥_{F}^{2} + m = 1 \sum M γ_{m} ∥ U_{m} ∥_{1}, s.t. U_{m} ⪰ 0, m = 1, 2, \dots, M,

G = E_{r} \times_{1} U_{1} \times_{2} U_{2} \dots \times_{M} U_{M} = B \times_{m} U_{m},

G = E_{r} \times_{1} U_{1} \times_{2} U_{2} \dots \times_{M} U_{M} = B \times_{m} U_{m},

min_{U_{m}} s.t. F (U_{m}) = Φ (U_{m}) + Ω (U_{m}), U_{m} ⪰ 0,

min_{U_{m}} s.t. F (U_{m}) = Φ (U_{m}) + Ω (U_{m}), U_{m} ⪰ 0,

l^{σ} (u_{ij}) = max_{Q \in Q} ⟨ u_{ij}, q_{ij} ⟩ - \frac{σ}{2} q_{ij}^{2},

l^{σ} (u_{ij}) = max_{Q \in Q} ⟨ u_{ij}, q_{ij} ⟩ - \frac{σ}{2} q_{ij}^{2},

q_{ij} = median {\frac{u _{ij}}{σ}, - 1, 1} .

q_{ij} = median {\frac{u _{ij}}{σ}, - 1, 1} .

\begin{split}l^{\sigma}=\left\{\begin{array}[]{cc}-u_{ij}-\frac{\sigma}{2},&u_{ij}<-\sigma;\\ u_{ij}-\frac{\sigma}{2},&u_{ij}>\sigma;\\ \frac{u_{ij}^{2}}{2\sigma},&\mathrm{otherwise}.\end{array}\right.\end{split}

\begin{split}l^{\sigma}=\left\{\begin{array}[]{cc}-u_{ij}-\frac{\sigma}{2},&u_{ij}<-\sigma;\\ u_{ij}-\frac{\sigma}{2},&u_{ij}>\sigma;\\ \frac{u_{ij}^{2}}{2\sigma},&\mathrm{otherwise}.\end{array}\right.\end{split}

\frac{\partial l ^{σ} ( U )}{\partial U} = Q,

\frac{\partial l ^{σ} ( U )}{\partial U} = Q,

\frac{\partial Φ ( U )}{\partial U} = \frac{1}{N ^{'}} k \sum \frac{2 y _{k} ( δ _{k} δ _{k}^{T} ) U}{1 + exp ( ρ z _{k} )} + \frac{2 γ}{P} p \sum (U B B^{T} - W^{p} B^{T}),

\frac{\partial Φ ( U )}{\partial U} = \frac{1}{N ^{'}} k \sum \frac{2 y _{k} ( δ _{k} δ _{k}^{T} ) U}{1 + exp ( ρ z _{k} )} + \frac{2 γ}{P} p \sum (U B B^{T} - W^{p} B^{T}),

\frac{\partial F ^{σ} ( U _{m} )}{\partial U _{m}} = \frac{1}{N _{m}^{'}} k \sum \frac{2 y _{mk} ( δ _{mk} δ _{mk}^{T} ) U _{m}}{1 + exp ( ρ z _{mk} )} + \frac{2 γ}{P} p \sum (U_{m} B_{(m)} B_{(m)}^{T} - W_{(m)}^{p} B_{(m)}^{T}) + γ_{m} Q_{m} .

\frac{\partial F ^{σ} ( U _{m} )}{\partial U _{m}} = \frac{1}{N _{m}^{'}} k \sum \frac{2 y _{mk} ( δ _{mk} δ _{mk}^{T} ) U _{m}}{1 + exp ( ρ z _{mk} )} + \frac{2 γ}{P} p \sum (U_{m} B_{(m)} B_{(m)}^{T} - W_{(m)}^{p} B_{(m)}^{T}) + γ_{m} Q_{m} .

U_{m}^{t + 1} = π [U_{m}^{t} - μ_{t} \nabla F^{σ} (U_{m}^{t})],

U_{m}^{t + 1} = π [U_{m}^{t} - μ_{t} \nabla F^{σ} (U_{m}^{t})],

F^{σ} (U_{m}^{t + 1}) - F^{σ} (U_{m}^{t}) \leq κ \nabla F^{σ} (U_{m}^{t})^{T} (U_{m}^{t + 1} - U_{m}^{t}),

F^{σ} (U_{m}^{t + 1}) - F^{σ} (U_{m}^{t}) \leq κ \nabla F^{σ} (U_{m}^{t})^{T} (U_{m}^{t + 1} - U_{m}^{t}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Heterogeneous Multi-task Metric Learning across Multiple Domains

Yong Luo, Yonggang Wen, and Dacheng Tao Y. Luo and Y. Wen are with the School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, e-mail: [email protected], [email protected]. Tao is with the Centre for Quantum Computation & Intelligent Systems and the Faculty of Engineering and Information Technology, University of Technology, Sydney, 81 Broadway Street, Ultimo, NSW 2007, Australia, e-mail: [email protected].©2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract

Distance metric learning (DML) plays a crucial role in diverse machine learning algorithms and applications. When the labeled information in target domain is limited, transfer metric learning (TML) helps to learn the metric by leveraging the sufficient information from other related domains. Multi-task metric learning (MTML), which can be regarded as a special case of TML, performs transfer across all related domains. Current TML tools usually assume that the same feature representation is exploited for different domains. However, in real-world applications, data may be drawn from heterogeneous domains. Heterogeneous transfer learning approaches can be adopted to remedy this drawback by deriving a metric from the learned transformation across different domains. But they are often limited in that only two domains can be handled. To appropriately handle multiple domains, we develop a novel heterogeneous multi-task metric learning (HMTML) framework. In HMTML, the metrics of all different domains are learned together. The transformations derived from the metrics are utilized to induce a common subspace, and the high-order covariance among the predictive structures of these domains is maximized in this subspace. There do exist a few heterogeneous transfer learning approaches that deal with multiple domains, but the high-order statistics (correlation information), which can only be exploited by simultaneously examining all domains, is ignored in these approaches. Compared with them, the proposed HMTML can effectively explore such high-order information, thus obtaining more reliable feature transformations and metrics. Effectiveness of our method is validated by the extensive and intensive experiments on text categorization, scene classification, and social image annotation.

Index Terms:

Heterogeneous domain, distance metric learning, multi-task, tensor, high-order statistics

I Introduction

The objective of distance metric learning (DML) is to find a measure to appropriately estimate the distance or similarity between data. DML is critical in various research fields, e.g., k-means clustering, k-nearest neighbor (kNN) classification, the sophisticated support vector machine (SVM) [1, 2] and learning to rank [3]. For instance, the kNN based models can be comparable or superior to other well-designed classifiers by learning proper distance metrics [4, 5]. Besides, it is demonstrated in [1] that learning the Mahalanobis metric for the RBF kernel in SVM without model selection consistently outperforms the traditional SVM-RBF, in which the hyperparameter is determined by cross validation.

Recently, some transfer metric learning (TML) [6, 7] methods are proposed for DML in case that labeled data provided in target domain (the domain of interest) is insufficient, while the labeled information in certain related, but different source domains is abundant. In this scenario, the data distributions of the target and source domain may differ a lot, hence the traditional DML algorithms usually do not perform well. Indeed, TML [6, 7] tries to reduce the distribution gap and help learn the target metric by utilizing the labeled data from the source domains. In particular, multi-task metric learning (MTML) [7] aims to simultaneously improve the metric learning of all different (source and target) domains since the labeled data for each of them is assumed to be scarce.

There is a main drawback in most of the current TML algorithms, i.e., samples of related domains are assumed to have the same feature dimension or share the same feature space. This assumption, however, is invalid in many applications. A typical example is the classification of multilingual documents. In this application, the feature representation of a document written in one language is different from that written in other languages since different vocabularies are utilized. Also, in some face recognition and object recognition applications, images may be collected under different environmental conditions, hence their representations have different dimensions. Besides, in natural image classification and multimedia retrieval, we often extract different kinds of features (such as global wavelet texture and local SIFT [8]) to represent the samples, which can also have different modalities (such as text, audio and image). Each feature space or modality can be regarded as a domain.

In the literature, various heterogeneous transfer learning [9, 10, 11] algorithms have been developed to deal with heterogeneous domains. In these approaches, a widely adopted strategy is to reduce the difference of heterogeneous domains [11] by transforming their representations into a common subspace. A metric can be derived from the transformation learned for each domain. Although has achieve success in some applications, these approaches suffer a major limitation that only two domains (one source and one target domain) can be handled. In practice, however, the number of domains is usually more than two. For instance, the news from the Reuters multilingual collection is written in five different languages. Besides, it is common to utilize various kinds of features (such as global, local, and biologically inspired) in visual analysis-based applications.

To remedy these drawbacks, we propose a novel heterogeneous multi-task metric learning (HMTML) framework, which is able to handle an arbitrary number of domains. In the proposed HMTML, the metrics of all domains are learned in a single optimization problem, where the empirical losses w.r.t. the metric for each domain are minimized. Meanwhile, we derive feature transformations from the metrics and use them to project the predictive structures of different domains into a common subspace. In this paper, all different domains are assumed to have the same application [10, 12], such as article classification with the same categories. Consequently, the predictive structures, which are parameterized by the weight vector of classifiers, should be close to each other in the subspace. Then a connection between different metrics is built by minimizing the divergence between the transformed predictive structures. Compared to the learning of different metrics separately, more reliable metrics can be obtained since different domains help each other in our method. This is particularly important for those domains with little label information.

There do exist a few approaches [10, 12] that can learn transformations and derive metrics for more than two domains. These approaches, however, only explore the statistics (correlation information) between pairs of representations in either one-vs-one [10], or centralized [12] way. While the high-order statistics are ignored, which can only be obtained by examining all domains simultaneously. Our method is more advantageous than these approaches in that we directly analyze the covariance tensor over the prediction weights of all domains. This encodes the high-order correlation information in the learned transformations and thus hopefully we can achieve better performance. Experiments are conducted extensively on three popular applications, i.e., text categorization, scene classification, and social image annotation. We compare the proposed HMTML with not only the Euclidean (EU) baseline and representative DML (single domain) algorithms [4, 13, 14], but also two representative heterogeneous transfer learning approaches [10, 12] that can deal with more than two domains. The experimental results demonstrate the superiority of our method.

This paper differs from our conference works [15, 16] significantly since: 1) in [15], we need large amounts of unlabeled corresponding data to build domain connection. In this paper, we utilize the predictive structures to bridge different domains and thus only limited labeled data are required. Therefore, the critical regularization term that enables knowledge transfer is quite different from that in [15]; 2) in [16], we aim to learn features for different domains, instead of the distance metrics in this paper. The optimization problem is thus quite different since we directly optimize w.r.t. the distance metrics in this paper.

The rest of the paper is structured as follows. Some closely related works are summarized in Section II. In Section III, we first give an overall description of the proposed HMTML, and then present its formulation together with some analysis. Experimental results are reported and analyzed in Section IV, and we conclude this paper in Section V.

II Related Work

II-A Distance metric learning

The goal of distance metric learning (DML) [17] is to learn an appropriate distance function over the input space, so that the relationships between data are appropriately reflected. Most conventional metric learning methods, which are often called “Mahalanobis metric learning”, can be regarded as learning a linear transformation of the input data [18, 19]. The first work of Mahalanobis metric learning was done by Xing et al. [20], where a constrained convex optimization problem with no regularization was proposed. Some other representative algorithms include the neighborhood component analysis (NCA) [21], large margin nearest neighbors (LMNN) [4], information theoretic metric learning (ITML) [13], and regularized distance metric learning (RDML) [14].

To capture the nonlinear structure in the data, one can extend the linear metric learning methods to learn nonlinear transformations by adopting the kernel trick or “Kernel PCA (KPCA)” trick [22]. Alternatively, neural networks can be utilized to learn arbitrarily complex nonlinear mapping for metric learning [23]. Some other representative nonlinear metric learning approaches include the gradient-boosted LMNN (GB-LMNN) [24], Hamming distance metric learning (HDML) [25], and support vector metric learning (SVML) [1].

Recently, transfer metric learning (TML) has attracted intensive attention to tackle the labeled data deficiency issue in the target domain [26, 7] or all given related domains [27, 28, 7]. The latter is often called multi-task metric learning (MTML), and is the focus of this paper. In [27], they propose mt-LMNN, which is an extension of LMNN to the multi-task setting [29]. This is realized by representing the metric of each task as a summation of a shared Mahalanobis metric and a task-specific metric. The level of information to be shared across tasks is controlled by the trade-off hyper-parameters of the metric regularizations. Yang et al. [28] generalize mt-LMNN by utilizing von Neumann divergence based regularization to preserve geometry between the learned metrics. An implicit assumption of these methods is that the data samples of different domains lie in the same feature space. Thus these approaches cannot handle heterogeneous features. To remedy this drawback, we propose heterogeneous MTML (HMTML) inspired by heterogeneous transfer learning.

II-B Heterogeneous transfer learning

Developments in transfer learning [30] across heterogeneous feature spaces can be grouped to two categories: heterogeneous domain adaptation (HDA) [10, 11] and heterogeneous multi-task learning (HMTL) [12]. In HDA, there is usually only one target domain that has limited labeled data, and our aim is to utilize sufficient labeled data from related source domains to help the learning in the target domain. Whereas in HMTL, the labeled data in all domains are scarce, and thus we treat different domains equally and make them help each other.

Most HDA methods only have two domains, i.e., one source and one target domain. The main idea in these methods is to either map the heterogeneous data into a common feature space by learning a feature mapping for each domain [9, 31], or map the data from the source domain to the target domain by learning an asymmetric transformation [32, 11]. The former is equivalent to Mahalanobis metric learning since each learned mapping could be used to derive a metric directly. Wang and Mahadevan [10] presented a HDA method based on manifold alignment (DAMA). This method manages several domains, and the feature mappings are learned by utilizing the labels shared by all domains to align each pair of manifolds. Compared with HDA, there are much fewer works on HMTL. One representative approach is the multi-task discriminant analysis (MTDA) [12], which extends linear discriminant analysis (LDA) to learn multiple tasks simultaneously. In MTDA, a common intermediate structure is assumed to be shared by the learned latent representations of different domains. MTDA can also deal with more than two domains, but is limited in that only the pairwise correlations (between each latent representation and the shared representation) are exploited. Therefore, the high-order correlations between all domains are ignored in both DAMA and MTDA. We develop the following tensor based heterogeneous multi-task metric learning framework to rectify this shortcoming.

III Heterogeneous multi-task metric learning

In DAMA [10] and MTDA [12], linear transformation is learned for each domain and only pairwise correlation information is explored. In contrast to them, tensor based heterogeneous MTML (HMTML) is developed in this paper to exploit the high-order tensor correlations for metric learning. Fig. 1 is an overview of the proposed method. Here, we take multilingual text classification as an example to illustrate the main idea. The number of labeled samples for each of the $M$ heterogenous domains (e.g., “English”, “Italian”, and “German”) are assumed to be small. For the $m$ ’th domain, empirical losses w.r.t. the metric $A_{m}$ are minimized on the labeled set $\mathcal{D}_{m}$ . The obtained metrics may be unreliable due to the labeled data deficiency issue in each domain. Hence we let the different domains help each other to learn improved metrics by sharing information across them. This is performed by first constructing multiple prediction (such as binary classification) problems in the $m$ ’th domain, and a set of prediction weight vectors $\{\mathbf{w}_{m}^{p}\}_{p=1}^{P}$ are obtained by training the problems on $\mathcal{D}_{m}$ . Here, $\mathbf{w}_{m}^{p}$ is the weight vector of the $p$ ’th base classifier, and we assume each of the $P$ base classifiers is linear, i.e., $f^{p}(\mathbf{x}_{m})=(\mathbf{w}_{m}^{p})^{T}\mathbf{x}_{m}$ . We follow [11] to generate the $P$ prediction problems, where the well-known Error Correcting Output Codes (ECOC) scheme is adopted. Then we decompose the metric $A_{m}$ as $A_{m}=U_{m}U_{m}^{T}$ , and the obtained $U_{m}$ is used to transform the weight vectors into a common subspace, i.e., $\{\mathbf{v}_{m}^{p}=U_{m}^{T}\mathbf{w}_{m}^{p}\}_{p=1}^{P}$ . As illustrated in Section I, the application is the same for different domains. Therefore, after being transformed into the common subspace, the weight vectors should be close to each other, i.e., all $\{\mathbf{v}_{1}^{p},\mathbf{v}_{2}^{p},\ldots,\mathbf{v}_{m}^{p}\}$ should be similar. Finally, by maximizing the tensor based high-order covariance between all $\{\mathbf{v}_{m}^{p}\}$ , we obtain improved $U_{m}^{\ast}$ , where additional information from other domains are involved. Then more reliable metric can be obtained since $A_{m}^{\ast}=U_{m}^{\ast}(U_{m}^{\ast})^{T}$ .

In short, ECOC generates a binary “codebook” to encode each class as a binary string, and trains multiple binary classifiers according to the “codebook”. The test sample is classified using all the trained classifiers and the results are concatenated to obtain a binary string, which is decoded for final prediction [33]. In this paper, the classification weights of all the trained binary classifiers are used to build domain connections. It should be noted that the learned weights may not be reliable if given limited labeled samples. Fortunately, we can construct sufficient base classifiers using the ECOC scheme, and thus robust transformations can be obtained even some learned base classifiers are inaccurate or incorrect [11]. The reasons why the decomposition $U_{m}$ of the metric $A_{m}$ can be used to transform the parameters of classifiers are mainly based on two points: 1) Mahanobias metric learning can be formulated as learning a linear transformation [18]. In the literature of distance metric learning (DML), most methods focus on learning the Mahalanobis distance, which is often denoted as

[TABLE]

where $A$ is the metric, which is a positive semi-definite matrix and can be factorized as $A=UU^{T}$ . By applying some simple algebraic manipulations, we have $d_{A}(\mathbf{x}_{i},\mathbf{x}_{j})=\|U\mathbf{x}_{i}-U\mathbf{x}_{j}\|_{2}^{2}$ ; 2) according to the multi-task learning methods presented in [34, 35], the transformation learned in the parameter space can also be interpreted as defining mapping in the feature space. Therefore, the linear transformation $U_{m}$ can be used to transform the parameters of classifiers. The technical details of the proposed method are given below, and we start by briefing the frequently used notations and concepts of multilinear algebra in this paper.

III-A Notations

If $\mathcal{A}$ is an $M$ -th order tensor of size $I_{1}\times I_{2}\times\ldots\times I_{M}$ , and $U$ is a $J_{m}\times I_{m}$ matrix, then the $m$ -mode product of $\mathcal{A}$ and $U$ is signified as $\mathcal{B}=\mathcal{A}\times_{m}U$ , which is also an $M$ -th order tensor of size $I_{1}\times\ldots\times I_{m-1}\times J_{m}\times I_{m+1}\ldots\times I_{M}$ with the entry

[TABLE]

The product of $\mathcal{A}$ and a set of matrices $\{U_{m}\in\mathbb{R}^{J_{m}\times I_{m}}\}_{m=1}^{M}$ is given by

[TABLE]

The mode- $m$ matricization of $\mathcal{A}$ is a matrix $A_{(m)}$ of size $I_{m}\times(I_{1}\ldots I_{m-1}I_{m+1}\ldots I_{M})$ . We can regard the $m$ -mode multiplication $\mathcal{B}=\mathcal{A}\times_{m}U$ as matrix multiplication in the form of $B_{(m)}=UA_{(m)}$ . Let $\mathbf{u}$ be an $I_{m}$ -vector, the contracted $m$ -mode product of $\mathcal{A}$ and $\mathbf{u}$ is denoted as $\mathcal{B}=\mathcal{A}\bar{\times}_{m}\mathbf{u}$ , which is an $M-1$ -th tensor of size $I_{1}\times\ldots\times I_{m-1}\times I_{m+1}\ldots\times I_{M}$ . The elements are calculated by

[TABLE]

Finally, the Frobenius norm of the tensor $\mathcal{A}$ is calculated as

[TABLE]

III-B Problem formulation

Suppose there are $M$ heterogeneous domains, and the labeled training set for the $m$ ’th domain is $\mathcal{D}_{m}=\{(\mathbf{x}_{mn},y_{mn})\}_{n=1}^{N_{m}}$ , where $\mathbf{x}_{mn}\in\mathbb{R}^{d_{m}}$ and its corresponding class label $y_{mn}\in\{1,2,\ldots,C\}$ . The label set is the same for the different heterogeneous domains [10, 12]. Then we have the following general formulation for the proposed HMTML

[TABLE]

where $\Psi(A_{m})=\frac{2}{N_{m}(N_{m}-1)}\sum_{i<j}L(A_{m};\mathbf{x}_{mi},\mathbf{x}_{mj},y_{mij})$ is the empirical loss w.r.t. $A_{m}$ in the $m$ ’th domain, $y_{mij}=\pm 1$ indicates $\mathbf{x}_{mi}$ and $\mathbf{x}_{mj}$ are similar/dissimilar to each other, and $R(A_{1},A_{2},\ldots,A_{M})$ is a carefully chosen regularizer that enables knowledge transfer across different domains. Here, $y_{mij}$ is obtained according to whether $\mathbf{x}_{mi}$ and $\mathbf{x}_{mj}$ belong to the same class or not. Following [14], we choose $L(A_{m};\mathbf{x}_{mi},\mathbf{x}_{mj},y_{mij})=g(y_{mij}[1-\|\mathbf{x}_{mi}-\mathbf{x}_{mj}\|_{A_{m}}^{2}])$ and adopt the generalized log loss (GL-loss) [36] for $g$ , i.e., $g(z)=\frac{1}{\rho}\mathrm{log}(1+\mathrm{exp}(-\rho z))$ . Here, $\|\mathbf{x}_{mi}-\mathbf{x}_{mj}\|_{A_{m}}^{2}=(\mathbf{x}_{mi}-\mathbf{x}_{mj})^{T}A_{m}(\mathbf{x}_{mi}-\mathbf{x}_{mj})$ . The GL-loss is a smooth version of the hinge loss, and the smoothness is controlled by the hyper-parameter $\rho$ , which is set as $3$ in this paper. For notational simplicity, we denote $\mathbf{x}_{mi}$ , $\mathbf{x}_{mj}$ and $y_{mij}$ as $\mathbf{x}_{mk}^{1}$ , $\mathbf{x}_{mk}^{2}$ and $y_{mk}$ respectively, where $k=1,2,\ldots,N_{m}^{\prime}=\frac{N_{m}(N_{m}-1)}{2}$ . We also set $\delta_{mk}=\mathbf{x}_{mk}^{1}-\mathbf{x}_{mk}^{2}$ so that $\|\mathbf{x}_{mk}^{1}-\mathbf{x}_{mk}^{2}\|_{A_{m}}^{2}=\delta_{mk}^{T}A_{m}\delta_{mk}$ , and the loss term becomes $\Psi(A_{m})=\frac{1}{N_{m}^{\prime}}\sum_{k=1}^{N_{m}^{\prime}}g\left(y_{mk}(1-\delta_{mk}^{T}A_{m}\delta_{mk})\right)$ .

To transfer information across different domains, $P$ binary classification problems are constructed and a set of classifiers $\{\mathbf{w}_{m}^{p}\}_{p=1}^{P}$ are learned for each of the $M$ domains by utilizing the labeled training data. This process can be carried out in an offline manner and thus does not has influence on the computational cost of subsequent metric learning. Considering that $A_{m}=U_{m}U_{m}^{T}$ due to the positive semi-definite property of the metric, we propose to use $U_{m}$ to transform $\mathbf{w}_{m}^{p}$ and this leads to $\mathbf{v}_{m}^{p}=U_{m}^{T}\mathbf{w}_{m}^{p}$ . Then the divergence of all $\{\mathbf{v}_{1}^{p},\mathbf{v}_{2}^{p},\ldots,\mathbf{v}_{M}^{p}\}$ are minimized. In the following, we first derive the formulation for $M=2$ (two domains), and then generalize it for $M>2$ .

When $M=2$ , we have the following formulation:

[TABLE]

where $U_{m}\in\mathbb{R}^{d_{m}\times r}$ is the transformation of the $m$ ’th domain, and $r$ is the number of common factors shared by different domains. The trade-off hyper-parameters $\{\gamma_{m}>0\}$ , the $l_{1}$ -norm $\|U_{m}\|_{1}=\sum_{i=1}^{d}\sum_{j=1}^{r}|u_{mij}|$ , and $\succeq$ indicates that all entries of $U_{m}$ are non-negative. The transformation $U_{m}$ is restricted to be sparse since the features in different domains usually have sparse correspondences. The non-negative constraints can narrow the hypothesis space and improve the interpretability of the results.

For $M>2$ , we generalize (6) as

[TABLE]

where $G=U_{1}E_{r}U_{2}^{T}$ , and $E_{r}$ is an identity matrix of size $r$ . By using the tensor notation, we have $G=E_{r}\times_{1}U_{1}\times_{2}U_{2}$ , so the formulation (7) for $M>2$ is given by

[TABLE]

where $\mathcal{G}=\mathcal{E}_{r}\times_{1}U_{1}\times_{2}U_{2}\ldots\times_{M}U_{M}$ is a transformation tensor, $\mathcal{E}_{r}\in\mathbb{R}^{r\times r\times\ldots\times r}$ is an identity tensor (the diagonal elements are $1$ , and all other entries are [math]). A specific optimization problem for HMTML can be obtained by choosing the regularizer in (5) as the objective of (8), i.e.,

[TABLE]

We reformulate (9) using the following theorem due to its inconvenience in optimization.

Theorem 1

If $\|\mathbf{w}_{m}^{p}\|_{2}^{2}=1,p=1,\ldots,P;m=1,\ldots,M$ , then we have:

[TABLE]

where $\circ$ is the outer product.

We leave the proof in the supplementary material.

By substituting (10) into (9) and replacing $\mathcal{G}$ with $\mathcal{E}_{r}\times_{1}U_{1}\times_{2}U_{2}\ldots\times_{M}U_{M}$ , we obtain the following reformulation of (9):

[TABLE]

where $\mathcal{W}^{p}=\mathbf{w}_{1}^{p}\circ\mathbf{w}_{2}^{p}\ldots\circ\mathbf{w}_{M}^{p}$ is the covariance tensor of the prediction weights of all different domains. It is intuitive that a latent subspace shared by all domains can be found by minimizing the second term in (11). In this subspace, the representations of different domains are close to each other and knowledge is transferred. Hence different domains can help each other to learn improved transformation $U_{m}$ , and also the distance metric $A_{m}$ .

III-C Optimization algorithm

Problem (11) can be solved using an alternating optimization strategy. That is, only one variable $U_{m}$ is updated at a time and all the other $U_{m^{\prime}}$ , $m^{\prime}\neq m$ are fixed. This updating procedure is conducted iteratively for each variable. According to [37], we have

[TABLE]

where $\mathcal{B}=\mathcal{E}_{r}\times_{1}U_{1}\ldots\times_{m-1}U_{m-1}\times_{m+1}U_{m+1}\ldots\times_{M}U_{M}$ . By applying the metricizing property of the tensor-matrix product, we have $G_{(m)}=U_{m}B_{(m)}$ . Besides, it is easy to verify that $\|\mathcal{W}^{p}-\mathcal{G}\|_{F}^{2}=\|W_{(m)}^{p}-G_{(m)}\|_{F}^{2}$ . Therefore, the sub-problem of (11) w.r.t. $U_{m}$ becomes:

[TABLE]

where $\Phi(U_{m})=\frac{1}{N_{m}^{\prime}}\sum_{k=1}^{N_{m}^{\prime}}g\left(y_{mk}(1-\delta_{mk}^{T}U_{m}U_{m}^{T}\delta_{mk})\right)+\frac{\gamma}{P}\sum_{p=1}^{P}\|W_{(m)}^{p}-U_{m}B_{(m)}\|_{F}^{2}$ , and $\Omega(U_{m})=\gamma_{m}\|U_{m}\|_{1}$ . We propose to solve the problem (12) efficiently by utilizing the projected gradient method (PGM) presented in [38]. However, the term in $\Omega(U_{m})$ is non-differentiable, we thus first smooth it according to [39]. For notational clarity, we omit the subscript $m$ in the following derivation. According to [39], the smoothed version of the $l_{1}$ -norm $l(u_{ij})=|u_{ij}|$ can be given by

[TABLE]

where $\mathcal{Q}=\{Q:-1\leq q_{ij}\leq 1,Q\in\mathbb{R}^{d\times r}\}$ and $\sigma$ is the smooth hyper-parameter, where we set it as $0.5$ empirically in our implementation according to the comprehensive study of the smoothed $l_{1}$ -norm in [40]. By setting the objective function of (13) to zero and then projecting $q_{ij}$ on $\mathcal{Q}$ , we obtain the following solution,

[TABLE]

By substituting the solution (14) back into (13), we have the piece-wise approximation of $l$ , i.e.,

[TABLE]

To utilize the PGM for optimization, we have to compute the gradient of the smoothed $l_{1}$ -norm to determine the descent direction. We summarize the results in the following theorem.

Theorem 2

The gradient of smoothed $l(U)=\|U\|_{1}=\sum_{i=1}^{d}\sum_{j=1}^{r}l(u_{ij})$ is

[TABLE]

where $Q$ is the matrix defined in (13).

In addition, we summarize the gradient of $\Phi(U)$ as follows,

Theorem 3

The gradient of $\Phi(U)$ w.r.t. $U$ is

[TABLE]

where $z_{k}=y_{k}(1-\delta_{k}^{T}UU^{T}\delta_{k})$ .

We leave the proofs of both theorems in the supplementary material. Therefore, the gradient of the smoothed $F(U_{m})$ is

[TABLE]

Based on the obtained gradient, we apply the improved PGM presented in [38] to minimize the smoothed primal $F^{\sigma}(U_{m})$ , i.e.,

[TABLE]

where the operator $\pi[x]$ projects all the negative entries of $x$ to zero, and $\mu_{t}$ is the step size that must satisfy the following condition:

[TABLE]

where the hyper-parameter $\kappa$ is chosen to be $0.01$ following [38]. The step size can be determined using the Algorithm 1 cited from [38] (Algorithm 4 therein), and the convergence of the algorithm is guaranteed according to [38]. The stopping criterion we utilized here is $|F^{\sigma}(U_{m}^{t+1})-F^{\sigma}(U_{m}^{t})|/(|F^{\sigma}(U_{m}^{t+1})-F^{\sigma}(U_{m}^{0})|<\epsilon)$ , where the initialization $U_{m}^{0}$ is the set as the results of the previous iterations in the alternating of all $\{U_{m}\}_{m=1}^{M}$ .

Finally, the solutions of (11) are obtained by alternatively updating each $U_{m}$ using Algorithm 1 until the stop criterion $|OBJ_{k+1}-OBJ_{k}|/|OBJ_{k}|<\epsilon$ is reached, where $OBJ_{k}$ is the objective value of (11) in the $k$ ’th iteration step. Because the objective value of (12) decreases at each iteration of the alternating procedure, i.e., $F(U_{m}^{k+1},\{U_{m^{\prime}}^{k}\}_{m^{\prime}\neq m})\leq F(\{U_{m}^{k}\})$ . This indicates that $F(\{U_{m}^{k+1}\})\leq F(\{U_{m}^{k}\})$ . Therefore, the convergence of the proposed HMTML algorithm is guaranteed. Once the solutions $\{U_{m}^{\ast}\}_{m=1}^{M}$ have been obtained, we can conduct subsequent learning, such as multi-class classification in each domain using the learned metric $A_{m}^{\ast}=U_{m}^{\ast}{U_{m}^{\ast}}^{T}$ . More implementation details can be found in the supplementary material.

III-D Complexity analysis

To analyze the time complexity of the proposed HMTML algorithm, we first present the computational cost of optimizing each $U_{m}$ as described in Algorithm 1. In each iteration of Algorithm 1, we must first determine the descent direction according to the gradient calculated using (18). Then an appropriate step size is obtained by exhaustedly checking whether the condition (20) is satisfied. In each check, we need to calculate the updated objective value of $F^{\sigma}(U_{m}^{t+1})$ . The main cost of objective value calculation is spent on $\sum_{p=1}^{P}\|W_{(m)}^{p}-U_{m}B_{(m)}\|_{F}^{2}$ , which can be rewritten as $\mathrm{tr}((U_{m}^{T}U_{m}B_{(m)}B_{(m)}^{T})-2\sum_{p=1}^{P}(W_{(m)}^{p}B_{(m)}^{T})U_{m}^{T})$ , where the constant part $\mathrm{tr}(\sum_{p=1}^{P}(W_{(m)}^{p})^{T}W_{(m)}^{p})$ is omitted. To accelerate computation, we pre-calculate $B_{(m)}B_{(m)}^{T}$ and $(\sum_{p}W_{(m)}^{p})B_{(m)}^{T}$ , which are independent on $U_{m}$ and the time costs are $O(r^{2}\prod_{m^{\prime}\neq m}d_{m^{\prime}})$ and $O(r\prod_{m=1}^{M}d_{m})$ respectively. After the pre-calculation, the time complexities of calculating $(U_{m}^{T}U_{m})(B_{(m)}B_{(m)}^{T})$ and $\sum_{p=1}^{P}(W_{(m)}^{p}B_{(m)}^{T})U_{m}^{T}$ become $O(r^{2}d_{m}+r^{2.807})$ and $O(r^{2}d_{m})$ respectively, where the complexity of $O(r^{2.807})$ comes from the multiplication of two square matrices of size $r$ using the Strassen algorithm [41]. It is easy to derive that the computational cost of the remained parts in the objective function is $O(rd_{m}N_{m}^{\prime})$ . Hence, the time cost of calculating the objective value is $O(rd_{m}N_{m}^{\prime}+r^{2}d_{m}+r^{2.807})$ after the pre-calculation.

Similarly, the main cost of gradient calculation is spent on $\sum_{p}(U_{m}B_{(m)}B_{(m)}^{T}-W_{(m)}^{p}B_{(m)}^{T})$ , and after pre-calculating $B_{(m)}B_{(m)}^{T}$ and $(\sum_{p}W_{(m)}^{p})B_{(m)}^{T}$ , we can derive that the time cost of calculating the gradient is $O(rd_{m}N_{m}^{\prime}+r^{2}d_{m})$ . Therefore, the computational cost of optimizing $U_{m}$ is $O[r\prod_{m^{\prime}\neq m}d_{m^{\prime}}(r+d_{m})+T_{2}((rd_{m}N_{m}^{\prime}+r^{2}d_{m})+T_{1}(rd_{m}N_{m}^{\prime}+r^{2}d_{m}+r^{2.807}))]$ , where $T_{1}$ is the number of checks that is needed to find the step size, and $T_{2}$ is the number of iterations for reaching the stop criterion. Considering that the optimal rank $r\ll d_{m}$ , we can simplify the cost as $O[r\prod_{m=1}^{M}d_{m}+T_{2}T_{1}(rd_{m}N_{m}^{\prime}+r^{2}d_{m})]$ . Finally, suppose the number of iterations for alternately updating all $\{U_{m}\}_{m=1}^{M}$ is $\Gamma$ , we obtain the time complexity of the proposed HMTML, i.e., $(\Gamma M[r\prod_{m=1}^{M}d_{m}+T_{2}T_{1}(r\bar{d}_{m}\bar{N}_{m}^{\prime}+r^{2}\bar{d}_{m})])$ , where $\bar{N}_{m}^{\prime}$ and $\bar{d}_{m}$ are average sample number and feature dimension of all domains respectively. This is linear w.r.t. $M$ and $\prod_{m=1}^{M}d_{m}$ , and quadratic in the numbers $r$ and $\bar{N}_{m}$ ( $\bar{N}_{m}^{\prime}$ is quadratic w.r.t. $\bar{N}_{m}$ ). Besides, it is common that $\Gamma<10$ , $T_{2}<20$ , and $T_{1}<50$ . Thus the complexity is not very high.

IV Experiments

In this section, the effectiveness and efficiency of the proposed HMTML are verified with experiments in various applications, i.e., document categorization, scene classification, and image annotation. In the following, we first present the datasets to be used and experimental setups.

IV-A Datasets and features

Document categorization: we adopt the Reuters multilingual collection (RMLC) [42] dataset with six populous categories. The news articles in this dataset are translated into five languages. The TF-IDF features are provided to represent each document. In our experiments, three languages (i.e., English, Spanish, and Italian) are chosen and each language is regarded as one domain. The provided representations are preprocessed by applying principal component analysis (PCA). After the preprocessing, $20\%$ energy is preserved. The resulting dimensions of the document representations are $245$ , $213$ , and $107$ respectively for the three different domains. The data preprocessing is mainly conducted to: 1) find the high-level patterns which are comparable for transfer; 2) avoid overfitting; 3) speed up the computation during the experiments. The main reasons for preserving only $20\%$ energy are: 1) the amount of energy being preserved does not has influence on the comparison of the effectiveness verification between the proposed approach and other methods. 2) it may lead to overfitting when the feature dimension of each domain is high. There are $18758$ , $24039$ , and $12342$ instances in the three different domains respectively. In each domain, the sizes of the training and test sets are the same.

Scene classification: We employ a popular natural scene dataset: Scene-15 [43]. The dataset consists of $4,585$ images that belong to $15$ categories. The features we used are the popular global feature GIST [44], local binary pattern (LBP) [45], and pyramid histogram of oriented gradient (PHOG) [46], where the dimensions are $20$ , $59$ , and $40$ respectively. A detailed description on how to extract these features can be found in [47]. Since the images usually lie in a nonlinear feature space, we preprocess the different features using the kernel PCA (KPCA) [48]. The result dimensions are the same as the original features.

Image annotation: A challenging dataset NUS-WIDE (NUS) [49], which consists of $269,648$ natural images is adopted. We conducted experiments on a subset of $16,519$ images with $12$ animal concepts: bear, bird, cat, cow, dog, elk, fish, fox, horse, tiger, whale, and zebra. In this subset, three kinds of features are extracted for each image: $500$ -D bag of visual words [43] based on SIFT descriptors [8], $144$ -D color auto-correlogram, and $128$ -D wavelet texture. We refer to [49] for a detailed description of the features. Similar to the scene classification, KPCA is adopted for preprocess, and the resulting dimensions are all $100$ .

In both scene classification and image annotation, we treat each feature space as a domain. In each domain, half of the samples are used for training, and the rest for test.

IV-B Experimental setup

The methods included for comparison are:

•

EU: directly using the simple Euclidean metric and original feature representations to compute the distance between samples.

•

LMNN [4]: the large margin nearest neighbor DML algorithm presented in [4]. Distance metric for different domains are learned separately. The number of local target neighbors is chosen from the set $\{1,2,\ldots,15\}$ .

•

ITML [13]: the information-theoretic metric learning algorithm presented in [13]. The trade-off hyper-parameter is tuned over the set $\{10^{i}|i=-5,-4,\ldots,4\}$ .

•

RDML [14]: an efficient and competitive distance metric learning algorithm presented in [14]. We tune trade-off hyper-parameter over the set $\{10^{i}|i=-5,-4,\ldots,4\}$ .

•

DAMA [10]: constructing mappings ${U_{m}}$ to link multiple heterogeneous domains using manifold alignment. The hyper-parameter is determined according to the strategy presented in [10].

•

MTDA [12]: applying supervised dimension reduction to heterogeneous representations (domains) simultaneously, where a multi-task extension of linear discriminant analysis is developed. The learned transformation $U_{m}=W_{m}H$ consists of a domain specific part $W_{m}\in\mathbb{R}^{d_{m}\times d^{\prime}}$ , and a common part $H\in\mathbb{R}^{d^{\prime}\times r}$ shared by all domains. Here, $d^{\prime}>r$ is the intermediate dimensionality. As presented in [12], the shared structure $H$ represents “some characteristics of the application itself in the common lower-dimensional space”. The hyper-parameter $d^{\prime}$ is set as $100$ since the model is not very sensitive to the hyper-parameter according to [12].

•

HMTML: the proposed heterogeneous multi-task metric learning method. We adopt linear SVMs to obtain the weight vectors of base classifiers for knowledge transfer. The hyper-parameters $\{\gamma_{m}\}$ are set as the same value since the different features have been normalized. Both $\gamma$ and $\gamma_{m}$ are optimized over the set $\{10^{i}|i=-5,-4,\ldots,4\}$ . It is not hard to determine the value of hyper-parameter $P$ , which is the number of base classifiers. According to [50, 51], the code length is suggested to be $15\log C$ if we use the sparse random design coding technique in ECOC. From the experimental results shown in [11], higher classification accuracy can be achieved with an increasing number $P$ , and when there are too many base classifiers, the accuracy may decrease. The optimal $P$ is achieved around (slightly larger than) $15\log C$ , we thus empirically set $P=10\lceil 1.5\log C\rceil$ in our method.

The single domain metric learning algorithms (LMNN, ITML, and RDML) only utilize the limited labeled samples in each domain, and do not make use of any additional information from other domains. In DAMA and MTDA, after learning $U_{m}$ , we derive the metric for each domain as $A_{m}=U_{m}U_{m}^{T}$ .

The task in each domain is to perform multi-class classification. The penalty hyper-parameters of the base SVM classifiers are empirically set to $1$ . For all compared methods, any types of classifiers can be utilized for final classification in each domain after learning the distance metrics for all domains. In the distance metric learning literature, it is common to use the nearest neighbour ( $1$ NN) classifier to directly evaluate the performance of the learned metric [6, 7, 52], so we mainly adopt it in this paper. Some results of using the sophisticated SVM classifiers can be found in the supplementary material.

In the KPCA preprocessing, we adopt the Gaussian kernel, i.e., $k(\mathbf{x}_{i},\mathbf{x}_{j})=\mathrm{exp}(-\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}/(2\omega^{2}))$ , where the hyper-parameter $\omega$ is empirically set as the mean of the distances between all training sample pairs.

In all experiments given below, we adopt the classification accuracy and macroF1 [53] score to evaluate the performance. The average performance of all domains is calculated for comparison. The experiments are run for ten times by randomly choosing different sets of labeled instances. Both the mean values and standard deviations are reported.

If unspecified, the hyper-parameters are determined using leave-one-out cross validation on the labeled set. For example, when the number of labeled samples is $5$ for each class in each domain, one sample is chosen for model evaluation and the remained four samples are used for model training. This leads to $C$ test examples and $4C$ training examples in each domain, where $C$ is the number of classes. This procedure is repeated five times by regarding one of the five labeled samples as test example in turn. The average classification accuracies of all domains from all the five runs are averaged for hyper-parameter determination. The chosen of an optimal common factors of multiple heterogeneous features is still an open problem [54], and we do not study it in this paper. To this end, for DAMA, MTDA, and the proposed HMTML, we do not tune the hyper-parameter $r$ , which is the number of common factors (or dimensionality of the common subspace) used to explain the original data of all domains. The performance comparisons are performed on a set of varied $r=\{1,2,5,8,10,20,30,50,80,100\}$ . The maximum value of $r$ is $20$ in scene classification since the lowest feature dimension is $20$ .

IV-C Overall classification results

Document categorization: In the training set, the number of labeled instances for each category is chosen randomly as $\{5,10,15\}$ . This is used to evaluate the performance of all compared approaches w.r.t. the number of labeled samples. The results of selecting more labeled samples are given in supplementary material. Fig. 2 and Fig. 3 shows the accuracies and macroF1 values respectively for different $r$ . The performance of all compared approaches at its best number (of common factors) can be found in Table I. From these results, we can draw several conclusions: 1) when the number of labeled samples increases, the performance of all approaches become better; 2) although limited number of labeled data is given in each domain, ITML and RDML can greatly improve the performance. This indicates that distance metric learning (DML) is an effective tool in this application; 3) the performance of all heterogeneous transfer learning methods (DAMA, MTDA, and HMTML) is much better than the single-domain DML algorithms (LMNN, ITML, and RDML). This indicates that it is useful to leverage information from other domains in DML; In addition, it can be seen from the results that the optimal number $r$ is often smaller than $30$ . Hence we may only need $30$ factors to distinguish the different categories in this dataset; 4) although effective and discriminative, the performance of MTDA is heavily dependent on label information. Therefore, when the labeled samples are scarce, MTDA is worse than DAMA. The topology preserving of DAMA is helpful when the labeled samples are insufficient; 5) the proposed HMTML also relies a lot on the label information to learn base classifiers, which are used to connect different domains. If the labeled samples are scarce, some learned classifiers may be unreliable. However, owing to the task construction strategy presented in [11], robust transformations can be obtained even we have learnt some incorrect or inaccurate classifiers. Therefore, our method outperforms DAMA even when the labeled instances are scarce; 6) overall, the developed HMTML is superior to both DAMA and MTDA at most numbers (of common factors). This can be interpreted as the expressive ability of the factors learned by our method are stronger than other compared methods. The main reason is that the high-order statistics of all domains are examined. However, only the pairwise correlations are exploited in DAMA and MTDA; 7) consistent performance have been achieved under different criteria (accuracy and macroF1 score). Specifically, compared with the competitive MTDA, the improvements of the proposed method are considerable. For example, when $5$ , $10$ , and $15$ labeled samples are utilized, the relative improvements are $10.0\%$ , $7.1\%$ , and $3.1\%$ respectively under macroF1 criteria.

In addition, we conduct experiments on more domains to verify that the proposed method is able to handle arbitrary number of heterogeneous domains. See the supplementary material for the results.

Scene classification: The number of labeled examples for each category varies in the set $\{4,6,8\}$ . The performance w.r.t. the number $r$ are shown in Fig. 4 and Fig. 5. The average performance are summarized in Table II. We can see from the results that: 1) when the number of labeled examples is $4$ , both LMNN and RDML are only comparable to directly using the Euclidean distance (EU). ITML and DAMA are only slightly better than the EU baseline. This may be because the learned metrics are linear, while the images are usually lie in a nonlinear feature space. Although the KPCA preprocessing can help to exploit the nonlinearity to some extent, it does not optimize w.r.t. the metric; 2) when comparing with the EU baseline, the heterogeneous transfer learning approaches do not improve that much as in document categorization. This can be interpreted as that the different domains correspond to different types of representations in this problem of scene classification. It is much more challenging than the classification of multilingual documents, where the same kind of feature (TF-IDF) with various vocabularies is adopted. In this application, the statistical properties of the various types of visual representations greatly differ from each other. Therefore, the common factors contained among these different domains are difficult to be discovered by only exploring the pair-wise correlations between them. However, much better performance can be obtained by the presented HMTML since the correlations of all domains are exploited simultaneously, especially when small number of labeled samples are given. Thus the superiority of our method in alleviating the labeled data deficiency issue is verified.

Image annotation: The number of labeled instances for each concept varies in the set $\{4,6,8\}$ . We show the annotation accuracies and MacroF1 scores of the compared methods in Fig. 6 and Fig. 7 respectively, and summarize the results at their best numbers (of common factors) in Table III. It can be observed from the results that: 1) all of the single domain metric learning algorithms do not perform well. This is mainly due to the challenging of the setting that each kind of feature is regarded as a domain. Besides, in this application, the different animal concepts are hard to distinguish since they have high inter-class similarity (e.g., “cat” is similar to “tiger”) or have large intra-class variability (such as “bird”). More labeled instances are necessary for these algorithms to work, and we refer to the supplementary material for demonstration; 2) DAMA is comparable to the EU baseline, and MTDA only obtains satisfactory accuracies when enough (e.g., $8$ ) labeled instances are provided. Whereas the proposed HMTML still outperforms other approaches in most cases, and the tendencies of the macroF1 score curves are the same as accuracy.

IV-D Further empirical study

In this section, we conduct some further empirical studies on the RMLC dataset. More results can be found in the supplementary material.

IV-D1 An investigation of the performance for the individual domains

Fig. 8 compares all different approaches on the individual domains. From the results, we can conclude that: 1) RDML improves the performance in each domain, and the improvements are similar for different domains, since they do not communicate with each other. In contrast, the transfer learning approaches improve much more than RDML in the domains where less discriminative information is contained in the original representations, such as IT (Italian) and EN (English). These results validate the success transfer of the knowledge among different domains, and this is really helpful in metric learning; 2) in the good-performing SP (Spanish) domain, the results of DAMA and MTDA are almost the same as RDML, and sometimes, even a little worse. This phenomenon indicates that in DAMA and MTDA, the discriminative domain can benefit little from the relatively non-discriminative domains. While the proposed HMTML still achieves significant improvements in this case. This demonstrates that the high-order correlation information between all domains is well discovered. Exploring this kind of information is much better than only exploring the correlation information between pairs of domains (as in DAMA and MTDA).

IV-D2 A self-comparison analysis

To see how the different design choices affect the performance, we conduct a self-comparison of the proposed HMTML. In particular, we compare HMTML with the following sub-models:

•

HMTML (loss=0): dropping the log-loss term in (11), and this amounts to solving just the optimization problem described in (8).

•

HMTML (reg=0): setting the hyper-parameter $\gamma=0$ in (11). Thus the sparsity of the factors is encouraged, whereas the base classifiers are not used.

•

HMTML (F-norm): replacing the $l_{1}$ -norm with the Frobenius norm in (11). The optimization becomes easier since the gradient of the Frobenius norm can be directly obtained without smoothing.

•

HMTML (no constr.): dropping the non-negative constraints in (11). That is, the operator $\pi[x]$ in (19) is not performed in the optimization.

The results are shown in Fig. 9, where $10$ labeled instances are selected for each category. From the results, we can see that solving only the proposed tensor-based divergence minimization problem (8) can achieve satisfactory performance when the number of common factors is large enough. Both the accuracy and macroF1 score at the best number of common factors are higher than those of HMTML (reg=0), which joint minimizes the empirical losses of all domains without minimizing their divergence in the subspace. HMTML (reg=0) is better when the number of common factors is small and the performance is steadier. Therefore, both the log-loss and divergence minimization terms are indispensable in the proposed model. HMTML (F-norm) is worse than the proposed HMTML that adopts the $l_{1}$ -norm. This demonstrates the benefits of enforcing sparsity in metric learning, as suggested in the literatures [55, 56, 57]. HMTML (no constr.) is worse than HMTML at most dimensions, although their best performance are comparable. This may be because the non-negative constraints help to narrow the hypothesis space and thus control the model complexity.

IV-D3 Sensitivity analysis w.r.t. different initializations and optimization orders

We can only obtain a local minimum since the proposed formulation (11) is not joint convex with respect to all the parameters. However, the proposed method can achieve satisfactory performance using only random initializations. Fig. 10 shows the model is insensitive to different initializations of the parameter set $\{U_{m}\}$ and demonstrates the effectiveness of the obtained local optimal solution.

In addition, Fig. 10 shows that the proposed model is also insensitive to the different optimization orders of the parameter set $\{U_{m}\}$ , and demonstrates that the order of optimizing $\{U_{m}\}$ would not affect the performance.

IV-D4 Empirical analysis of the computational complexity

The training time of different approaches can be found in Fig. 10. All the experiments are performed on the computer with the following specifications: $2.9$ GHz Intel Xeon (8 cores) with Matlab R2014b. The results indicate that the proposed HMTML is comparable to ITML when $r$ (number of common factors) is small, and slower than all other approaches when many common factors are required. The cost of HMTML is not strictly quadratic w.r.t. $r$ because the number of iterations needed for algorithm termination changes (for different $r$ ). Since the optimal number $r$ is usually not very large, the time cost of the proposed method is tolerable. It should be noted that the test time for different methods are almost the same because the sizes of the obtained metrics are the same.

V Conclusion

A novel approach for heterogeneous metric learning (HMTML) is proposed in this paper. In HMTML, we calculate the prediction weight covariance tensor of multiple heterogeneous domains. Then the high-order statistics among these different domains are discovered by analyzing the tensor. The proposed HMTML can successfully transfer the knowledge between different domains by finding a common subspace, and make them help each other in metric learning. In this subspace, the high-order correlations between all domains are exploited. An efficient optimization algorithm is developed for finding the solutions. It is demonstrated empirically that exploiting the high-order correlation information achieves better performance than the pairwise relationship exploration in traditional approaches.

It can be concluded from the experimental results on three challenging datasets: 1) when the labeled data is scarce, learning the metric for each individual domain may deteriorate the performance. This issue can be alleviated by jointly learning the metrics for multiple heterogeneous domains. This is in line with the multi-task learning literature; 2) the shared knowledge of different heterogenous domains can be exploited by heterogeneous transfer learning approaches. Each domain can benefit from such knowledge if the discovered common factors are appropriate. We show that it is particularly helpful to utilize the high-order statistics (correlation information) in discovering such factors.

The proposed algorithm is particularly suitable for the case when we want to improve the performance of multiple heterogeneous domains, where the labeled samples are scarce for each of them. Because the learned transformation is linear for each domain, the proposed method is very effective for text analysis based applications (such as document categorization), where the structure of the data distribution is usually linear. For the image data, we suggest adding a kernel PCA (KPCA) preprocessing, which may help improve the performance to some extent.

The relative higher computational cost may be the main drawback of the proposed HMTML when comparing with other heterogeneous transfer learning approaches. In the future, we will develop efficient algorithm for optimization. For example, the technique of parallel computing will be introduced to accelerate the optimization. In addition, to deal with complicated domains, we intend to learn nonlinear metrics by extending the presented HMTML.

Bibliography57

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Z. Xu, K. Q. Weinberger, and O. Chapelle, “Distance metric learning for kernel machines,” ar Xiv preprint ar Xiv:1208.3422 v 2 , 2013.
2[2] P. Bouboulis, S. Theodoridis, C. Mavroforakis, and L. Evaggelatou-Dalla, “Complex support vector machines for regression and quaternary classification,” IEEE Transactions on Neural Networks and Learning Systems , vol. 26, no. 6, pp. 1260–1274, 2015.
3[3] B. Mc Fee and G. R. Lanckriet, “Metric learning to rank,” in International Conference on Machine Learning , 2010, pp. 775–782.
4[4] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Advances in Neural Information Processing Systems , 2005, pp. 1473–1480.
5[5] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, “Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation,” in IEEE International Conference on Computer Vision , 2009, pp. 309–316.
6[6] Z. Zha, T. Mei, M. Wang, Z. Wang, and X. Hua, “Robust distance metric learning with auxiliary knowledge,” in International Joint Conference on Artificial Intelligence , 2009, pp. 1327–1332.
7[7] Y. Zhang and D. Y. Yeung, “Transfer metric learning with semi-supervised extension,” ACM Transactions on Intelligent Systems and Technology , vol. 3, no. 3, pp. 54:1–54:28, 2012.
8[8] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision , vol. 60, no. 2, pp. 91–110, 2004.