Robust Mahalanobis Metric Learning via Geometric Approximation   Algorithms

Diego Ihara; Neshat Mohammadi; Francesco Sgherzi; Anastasios; Sidiropoulos

arXiv:1905.09989·cs.LG·March 3, 2020

Robust Mahalanobis Metric Learning via Geometric Approximation Algorithms

Diego Ihara, Neshat Mohammadi, Francesco Sgherzi, Anastasios, Sidiropoulos

PDF

Open Access

TL;DR

This paper introduces a fast, parallelizable algorithm for robust Mahalanobis metric learning that effectively handles adversarial label noise, with theoretical guarantees and practical improvements demonstrated on various datasets.

Contribution

It presents a fully polynomial-time approximation scheme for robust Mahalanobis metric learning, leveraging linear programming tools and ensuring near-optimal performance despite adversarial label corruption.

Findings

01

Algorithm is nearly-linear time and fully parallelizable.

02

Effective in recovering metrics with adversarial label noise.

03

Experimental results show robustness on real, synthetic, and poisoned data.

Abstract

Learning Mahalanobis metric spaces is an important problem that has found numerous applications. Several algorithms have been designed for this problem, including Information Theoretic Metric Learning (ITML) [Davis et al. 2007] and Large Margin Nearest Neighbor (LMNN) classification [Weinberger and Saul 2009]. We study the problem of learning a Mahalanobis metric space in the presence of adversarial label noise. To that end, we consider a formulation of Mahalanobis metric learning as an optimization problem, where the objective is to minimize the number of violated similarity/dissimilarity constraints. We show that for any fixed ambient dimension, there exists a fully polynomial-time approximation scheme (FPTAS) with nearly-linear running time. This result is obtained using tools from the theory of linear programming in low dimensions. As a consequence, we obtain a fully-parallelizable…

Tables1

Table 1. Table 1: Average accuracy and standard deviation over 50 executions of ITML, LMNN and LPTML.

Data set	ITML	LMNN	LPTML_t=2000
Iris	$0.96 \pm 0.01$	$0.96 \pm 0.02$	$0.94 \pm 0.04$
Soybean	$0.95 \pm 0.04$	$0.96 \pm 0.04$	$0.90 \pm 0.05$
Synthetic	$0.97 \pm 0.02$	$1.00 \pm 0.00$	$1.00 \pm 0.00$

Equations16

ρ (f (x), f (y)) \leq u,

ρ (f (x), f (y)) \leq u,

ρ (f (x), f (y)) \geq ℓ .

ρ (f (x), f (y)) \geq ℓ .

∥ G p - G q ∥_{2}

∥ G p - G q ∥_{2}

∥ G p - G q ∥_{2}

∥ G p - G q ∥_{2}

(p - q)^{T} A (p - q)

(p - q)^{T} A (p - q)

(p - q)^{T} A (p - q)

(p - q)^{T} A (p - q)

w(F)=\left\{\begin{array}[]{ll}\inf_{\mathbf{A}\in{\cal A}_{F}}r^{T}\mathbf{A}r&\text{ if }{\cal A}_{F}\neq\emptyset\\ \infty&\text{ if }{\cal A}_{F}=\emptyset\end{array}\right.,

w(F)=\left\{\begin{array}[]{ll}\inf_{\mathbf{A}\in{\cal A}_{F}}r^{T}\mathbf{A}r&\text{ if }{\cal A}_{F}\neq\emptyset\\ \infty&\text{ if }{\cal A}_{F}=\emptyset\end{array}\right.,

E_{A} = {v \in R^{d} : ∥ Av ∥_{2} = 1} .

E_{A} = {v \in R^{d} : ∥ Av ∥_{2} = 1} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Machine Learning and Algorithms · Advanced Image and Video Retrieval Techniques

Full text

Robust Mahalanobis Metric Learning via Geometric Approximation Algorithms

Diego Ihara Centurion

Department of Computer Science

University of Illinois at Chicago

Chicago, IL 60607

[email protected]

&Neshat Mohammadi 11footnotemark: 1

Department of Computer Science

University of Illinois at Chicago

Chicago, IL 60607

[email protected]

&Francesco Sgherzi 11footnotemark: 1

Department of Computer Science

University of Illinois at Chicago

Chicago, IL 60607

[email protected]

&Anastasios Sidiropoulos 11footnotemark: 1

Department of Computer Science

University of Illinois at Chicago

Chicago, IL 60607

[email protected] Authors sorted in alphabetical order.

Abstract

Learning Mahalanobis metric spaces is an important problem that has found numerous applications. Several algorithms have been designed for this problem, including Information Theoretic Metric Learning ( $\mathsf{ITML}$ ) [Davis et al. 2007] and Large Margin Nearest Neighbor ( $\mathsf{LMNN}$ ) classification [Weinberger and Saul 2009]. We study the problem of learning a Mahalanobis metric space in the presence of adversarial label noise. To that end, we consider a formulation of Mahalanobis metric learning as an optimization problem, where the objective is to minimize the number of violated similarity/dissimilarity constraints. We show that for any fixed ambient dimension, there exists a fully polynomial-time approximation scheme (FPTAS) with nearly-linear running time. This result is obtained using tools from the theory of linear programming in low dimensions. As a consequence, we obtain a fully-parallelizable algorithm that recovers a nearly-optimal metric space, even when a small fraction of the labels is corrupted adversarially. We also discuss improvements of the algorithm in practice, and present experimental results on real-world, synthetic, and poisoned data sets.

1 Introduction

Learning metric spaces is a fundamental computational primitive that has found numerous applications and has received significant attention in the literature. We refer the reader to [Kulis et al.(2013), Li and Tian(2018)] for detailed exposition and discussion of previous work. At the high level, the input to a metric learning problem consists of some universe of objects $X$ , together with some similarity information on subsets of these objects. Here, we focus on pairwise similarity and dissimilarity constraints. Specifically, we are given ${\cal S},{\cal D}\subset{X\choose 2}$ , which are sets of pairs of objects that are labeled as similar and dissimilar respectively. We are also given some $u,\ell>0$ , and we seek to find a mapping $f:X\to Y$ , into some target metric space $(Y,\rho)$ , such that for all $x,y\in{\cal S}$ ,

[TABLE]

and for all $x,y\in{\cal D}$ ,

[TABLE]

In the case of Mahalanobis metric learning, we have $X\subset\mathbb{R}^{d}$ , with $|X|=n$ , for some $d\in\mathbb{N}$ , and the mapping $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is linear. Specifically, we seek to find a matrix $\mathbf{G}\in\mathbb{R}^{d\times d}$ , such that for all $\{p,q\}\in{\cal S}$ , we have

[TABLE]

and for all $\{p,q\}\in{\cal D}$ , we have

[TABLE]

1.1 Our Contribution

In general, there might not exist any $\mathbf{G}$ that satisfies all constraints of type 1 and 2. We are thus interested in finding a solution that minimizes the fraction of violated constraints, which corresponds to maximizing the accuracy of the mapping. We develop a $(1+\varepsilon)$ -approximation algorithm for optimization problem of computing a Mahalanobis metric space of maximum accuracy, that runs in near-linear time for any fixed ambient dimension $d\in\mathbb{N}$ . This algorithm is obtained using tools from geometric approximation algorithms and the theory of linear programming in small dimension. The following summarizes our result.

Theorem 1.1.

For any $d\in\mathbb{N}$ , $\varepsilon>0$ , there exists a randomized algorithm for learning $d$ -dimensional Mahalanobis metric spaces, which given an instance that admits a mapping with accuracy $r^{*}$ , computes a mapping with accuracy at least $r^{*}-\varepsilon$ , in time $d^{O(1)}n(\log{n}/\varepsilon)^{O(d)}$ , with high probability.

The above algorithm can be extended to handle various forms of regularization. We also propose several modifications of our algorithm that lead to significant performance improvements in practice. The final algorithm is evaluated experimentally on both synthetic and real-world data sets, and in a data poisoning scenario, and is compared against the currently best-known algorithms for the problem.

1.2 Related Work

Several algorithms for learning Mahalanobis metric spaces have been proposed. Notable examples include the SDP based algorithm of Xing et al. [Xing et al.(2003)Xing, Jordan, Russell, and Ng], the algorithm of Globerson and Roweis for the fully supervised setting [Globerson and Roweis(2006)], Information Theoretic Metric Learning ( $\mathsf{ITML}$ ) by Davis et al. [Davis et al.(2007)Davis, Kulis, Jain, Sra, and Dhillon], which casts the problem as a particular optimization minimizing LogDet divergence, as well as Large Margin Nearest Neighbor ( $\mathsf{LMNN}$ ) by Weinberger et al. [Weinberger et al.(2006)Weinberger, Blitzer, and Saul], which attempts to learn a metric geared towards optimizing $k$ -NN classification. We refer the reader to the surveys [Kulis et al.(2013), Li and Tian(2018)] for a detailed discussion of previous work. Our algorithm differs from previous approaches in that it seeks to directly minimize the number of violated pairwise distance constraints, which is a highly non-convex objective, without resorting to a convex relaxation of the corresponding optimization problem.

1.3 Organization

The rest of the paper is organized as follows. Section 2 describes the main algorithm and the proof of Theorem 1.1. Section 3 discusses practical improvements used in the implementation of the algorithm. Section 4 presents the experimental evaluation.

2 Mahalanobis Metric Learning as an LP-Type Problem

In this Section we present an approximation scheme for Mahalanobis metric learning in $d$ -dimensional Euclidean space, with nearly-linear running time. We begin by recalling some prior results on the class of LP-type problems, which generalizes linear programming. We then show that linear metric learning can be cast as an LP-type problem.

2.1 LP-type Problems

Let us recall the definition of an LP-type problem. Let ${\cal H}$ be a set of constraints, and let $w:2^{\cal H}\to\mathbb{R}\cup\{-\infty,+\infty\}$ , such that for any $G\subset{\cal H}$ , $w(G)$ is the value of the optimal solution of the instance defined by $G$ . We say that $({\cal H},w)$ defines an LP-type problem if the following axioms hold:

(A1) Monotonicity. For any $F\subseteq G\subseteq{\cal H}$ , we have $w(F)\leq w(G)$ .

(A2) Locality. For any $F\subseteq G\subseteq{\cal H}$ , with $-\infty<w(F)=w(G)$ , and any $h\in{\cal H}$ , if $w(G)<w(G\cup\{h\})$ , then $w(F)<w(F\cup\{h\})$ .

More generally, we say that $({\cal H},w)$ defines an LP-type problem on some ${\cal H}^{\prime}\subseteq{\cal H}$ , when conditions (A1) and (A2) hold for all $F\subseteq G\subseteq{\cal H}^{\prime}$ .

A subset $B\subseteq{\cal H}$ is called a basis if $w(B)>-\infty$ and $w(B^{\prime})<w(B)$ for any proper subset $B^{\prime}\subsetneq B$ . A basic operation is defined to be one of the following:

(B0) Initial basis computation. Given some $G\subseteq{\cal H}$ , compute any basis for ${\cal G}$ .

(B1) Violation test. For some $h\in{\cal H}$ and some basis $B\subseteq{\cal H}$ , test whether $w(B\cup\{h\})>w(B)$ (in other words, whether $B$ violates $h$ ).

(B2) Basis computation. For some $h\in{\cal H}$ and some basis $B\subseteq{\cal H}$ , compute a basis of $B\cup\{h\}$ .

2.2 An LP-type Formulation

We now show that learning Mahalanobis metric spaces can be expressed as an LP-type problem. We first note that we can rewrite (1) and (2) as

[TABLE]

and

[TABLE]

where $\mathbf{A}=\mathbf{G}^{T}\mathbf{G}$ is positive semidefinite.

We define ${\cal H}=\{0,1\}\times{\mathbb{R}^{d}\choose 2}$ , where for each $(0,\{p,q\})\in{\cal H}$ , we have a constraint of type (3), and for every $(1,\{p,q\})\in{\cal H}$ , we have a constraint of type (4). Therefore, for any set of constraints $F\subseteq{\cal H}$ , we may associate the set of feasible solutions for $F$ with the set ${\cal A}_{F}$ of all positive semidefinite matrices $\mathbf{A}\in\mathbb{R}^{n\times n}$ , satisfying (3) and (4) for all constraints in $F$ .

Let $w:2^{\cal H}\to\mathbb{R}$ , such that for all $F\in{\cal H}$ , we have

[TABLE]

where $r\in\mathbb{R}^{d}$ is a vector chosen uniformly at random from the unit sphere from some rotationally-invariant probability measure. Such a vector can be chosen, for example, by first choosing some $r^{\prime}\in\mathbb{R}^{d}$ , where each coordinate is sampled from the normal distribution ${\cal N}(0,1)$ , and setting $r=r^{\prime}/\|r^{\prime}\|_{2}$ .

Lemma 2.1.

*When $w$ is chosen as above, the pair $({\cal H},w)$ defines an LP-type problem of combinatorial dimension $O(d^{2})$ , with probability 1. Moreover, for any $n>0$ , if each $r_{i}$ is chosen using $\Omega(\log n)$ bits of precision, then for each $F\subseteq{\cal H}$ , with $n=|F|$ , the assertion holds with high probability. *

Proof.

Since adding constraints to a feasible instance can only make it infeasible, it follows that $w$ satisfies the monotonicity axiom (A1).

We next argue that the locality axion (A2) also holds, with high probability. Let $F\subseteq G\subseteq{\cal H}$ , with $-\infty<w(F)=w(G)$ , and let $h\in{\cal H}$ , with $w(G)<w(G\cup\{h\})$ . Let $\mathbf{A}_{F}\in{\cal A}_{F}$ and $\mathbf{A}_{G}\in{\cal A}_{G}$ be some (not necessarily unique) infimizers of $w(\mathbf{A})$ , when $\mathbf{A}$ ranges in ${\cal A}_{F}$ and ${\cal A}_{G}$ respectively. The set ${\cal A}_{F}$ , viewed as a convex subset of $\mathbb{R}^{d^{2}}$ , is the intersection of the SDP cone with $n$ half-spaces, and thus ${\cal A}_{F}$ has at most $n$ facets. There are at least two distinct infimizers for $w(\mathbf{A}_{G})$ , when $\mathbf{A}_{G}\in{\cal A}_{G}$ , only when the randomly chosen vector $r$ is orthogonal to a certain direction, which occurs with probability 0. When each entry of $r$ is chosen with $c\log n$ bits of precision, the probability that $r$ is orthogonal to any single hyperplane is at most $2^{-c\log n}=n^{-c}$ ; the assertion follows by a union bound over $n$ facets. This establishes that axiom (A2) holds with high probability.

It remains to bound the combinatorial dimension, $\kappa$ . Let $F\subseteq{\cal H}$ be a set of constraints. For each $\mathbf{A}\in{\cal A}_{F}$ , define the ellipsoid

[TABLE]

For any $\mathbf{A},\mathbf{A}^{\prime}\in{\cal A}_{F}$ , with ${\cal E}_{\mathbf{A}}={\cal E}_{\mathbf{A}^{\prime}}$ , and $\mathbf{A}=\mathbf{G}^{T}\mathbf{G}$ , $\mathbf{A}^{\prime}=\mathbf{G}^{\prime T}\mathbf{G}^{\prime}$ , we have that for all $p,q\in\mathbb{R}^{d}$ , $\|\mathbf{G}p-\mathbf{G}q\|_{2}=(p-q)^{T}\mathbf{A}(p-q)=(p-q)^{T}\mathbf{A}^{\prime}(p-q)=\|\mathbf{G}^{\prime}p-\mathbf{G}^{\prime}q\|_{2}$ . Therefore in order to specify a linear transformation $\mathbf{G}$ , up to an isometry, it suffices to specify the ellipsoid ${\cal E}_{\mathbf{A}}$ .

Each $\{p,q\}\in{\cal S}$ corresponds to the constraint that the point $(p-q)/u$ must lie in ${\cal E}_{\mathbf{A}}$ . Similarly each $\{p,q\}\in{\cal D}$ corresponds to the constraint that the point $(p-q)/\ell$ must lie either on the boundary or the exterior of ${\cal E}_{\mathbf{A}}$ . Any ellipsoid in $\mathbb{R}^{d}$ is uniquely determined by specifying at most $(d+3)d/2=O(d^{2})$ distinct points on its boundary (see [Welzl(1991), Chazelle(2000)]). Therefore, each optimal solution can be uniquely specified as the intersection of at most $O(d^{2})$ constraints, and thus the combinatorial dimension is $O(d^{2})$ . ∎

Lemma 2.2.

Any initial basis computation (B0), any violation test (B1), and any basis computation (B2) can be performed in time $d^{O(1)}$ .

Proof.

The violation test (B1) can be performed by solving one SDP to compute $w(B)$ , and another to compute $w(B\cup\{h\})$ . By Lemma 2.1 the combinatorial dimension is $O(d^{2})$ , thus each SDP has $O(d^{2})$ constraints, and be solved in time $d^{O(1)}$ .

The basis computation step (B2) can be performed starting with the set of constraints $B\cup\{h\}$ , and iteratively remove every constraint whose removal does not decrease the optimum cost, until we arrive at a minimal set, which is a basis. In total, we need to solve at most $d$ SDPs, each of size $O(d^{2})$ , which can be done in total time $d^{O(1)}$ .

Finally, by the choice of $w$ , any set containing a single constraint in ${\cal S}$ is a valid initial basis. ∎

2.3 Algorithmic implications

Using the above formulation of Mahalanobis metric learning as an LP-type problem, we can obtain our approximation scheme. Our algorithm uses as a subroutine an exact algorithm for the problem (that is, for the special case where we seek to find a mapping that satisfies all constraints). We first present the exact algorithm and then show how it can be used to derive the approximation scheme.

An exact algorithm.

[Welzl(1991)] obtained a simple randomized linear-time algorithm for the minimum enclosing ball and minimum enclosing ellipsoid problems. This algorithm naturally extends to general LP-type problems (we refer the reader to [Har-Peled(2011), Chazelle(2000)] for further details).

With the interpretation of Mahalanobis metric learning as an LP-type problem given above, we thus obtain a linear time algorithm for in $\mathbb{R}^{d}$ , for any constant $d\in\mathbb{N}$ . The resulting algorithm on a set of constraints $F\subseteq{\cal H}$ is implemented by the procedure $\mathsf{Exact\text{-}LPTML}(F;\emptyset)$ , which is presented in Algorithm 1. The procedure $\mathsf{LPTML}(F;B)$ takes as input sets of constraints $F,B\subseteq{\cal H}$ . It outputs a solution $\mathbf{A}\in\mathbb{R}^{d\times d}$ to the problem induced by the set of constraints $F\cup B$ , such that all constraints in $B$ are tight (that is, they hold with equality); if no such solution solution exists, then it returns $\mathsf{nil}$ . The procedure $\mathsf{Basic\text{-}LPTML}(B)$ computes $\mathsf{LPTML}(\emptyset;B)$ . The analysis of [Welzl(1991)] implies that when $\mathsf{Basic\text{-}LPTML}(B)$ is called, the cardinality of $B$ is at most the combinatorial dimension, which by Lemma 2.1 is $O(d^{2})$ . Thus the procedure $\mathsf{Basic\text{-}LPTML}$ can be implemented using one initial basis computation (B0) and $O(d^{2})$ basis computations (B2), which by Lemma 2.2 takes total time $d^{O(1)}$ .

An $(1+\varepsilon)$ -approximation algorithm.

It is known that the above exact linear-time algorithm leads to an nearly-linear-time approximation scheme for LP-type problems. This is summarized in the following. We refer the reader to [Har-Peled(2011)] for a more detailed treatment.

Lemma 2.3 ([Har-Peled(2011)], Ch. 15).

Let ${\cal A}$ be some LP-type problem of combinatorial dimension $\kappa>0$ , defined by some pair $({\cal H},w)$ , and let $\varepsilon>0$ . There exists a randomized algorithm which given some instance $F\subseteq{\cal H}$ , with $|F|=n$ , outputs some basis $B\subseteq F$ , that violates at most $(1+\varepsilon)k$ constraints in $F$ , such that $w(B)\leq w(B^{\prime})$ , for any basis $B^{\prime}$ violating at most $k$ constraints in $F$ , in time $O\left(t_{0}+\left(n+n\min\left\{\frac{\log^{\kappa+1}n}{\varepsilon^{2\kappa}},\frac{\log^{\kappa+2}n}{k\varepsilon^{2\kappa+2}}\right\}\right)(t_{1}+t_{2})\right)$ , where $t_{0}$ is the time needed to compute an arbitrary initial basis of ${\cal A}$ , and $t_{1}$ , $t_{2}$ , and $t_{3}$ are upper bounds on the time needed to perform the basic operations (B0), (B1) and (B2) respectively. The algorithm succeeds with high probability.

For the special case of Mahalanobis metric learning, the corresponding algorithm is given in Algorithm 2. The approximation guarantee for this algorithm is summarized in 1.1. We can now give the proof of our main result.

Proof of Theorem 1.1.

Follows immediately by Lemmas 2.2 and 2.3. ∎

Regularization.

We now argue that the LP-type algorithm described above can be extended to handle certain types of regularization on the matrix $\mathbf{A}$ . In methods based on convex optimization, introducing regularizers that are convex functions can often be done easily. In our case, we cannot directly introduce a regularizing term in the objective function that is implicit in Algorithm 2. More specifically, let $\mathsf{cost}(\mathbf{A})$ denote the total number of constraints of type (3) and (4) that $\mathbf{A}$ violates. Algorithm 2 approximately minimizes the objective function $\mathsf{cost}(\mathbf{A})$ . A natural regularized version of Mahalanobis metric learning is to instead minimize the objective function $\mathsf{cost}^{\prime}(\mathbf{A}):=\mathsf{cost}(\mathbf{A})+\eta\cdot\text{reg}(\mathbf{A})$ , for some $\eta>0$ , and regularizer $\text{reg}(\mathbf{A})$ . One typical choice is $\text{reg}(\mathbf{A})=\text{tr}(\mathbf{A}\mathbf{C})$ , for some matrix $\mathbf{C}\in\mathbb{R}^{d\times d}$ ; the case $\mathbf{C}=\mathbf{I}$ corresponds to the trace norm (see [Kulis et al.(2013)]). We can extend the Algorithm 2 to handle any regularizer that can be expressed as a linear function on the entries of $\mathbf{A}$ , such as $\text{tr}(\mathbf{A})$ . The following summarizes the result.

Theorem 2.4.

Let $\text{reg}(\mathbf{A})$ be a linear function on the entries of $\mathbf{A}$ , with polynomially bounded coefficients. For any $d\in\mathbb{N}$ , $\varepsilon>0$ , there exists a randomized algorithm for learning $d$ -dimensional Mahalanobis metric spaces, which given an instance that admits a solution $\mathbf{A}_{0}$ with $\mathsf{cost}^{\prime}(\mathbf{A}_{0})=c^{*}$ , computes a solution $\mathbf{A}$ with $\mathsf{cost}^{\prime}(\mathbf{A})\leq(1+\varepsilon)c^{*}$ , in time $d^{O(1)}n(\log{n}/\varepsilon)^{O(d)}$ , with high probability.

Proof.

If $\eta<\varepsilon^{t}$ , for sufficiently large constant $t>0$ , since the coefficients in $\text{reg}(\mathbf{A})$ are polynomially bounded, it follows that the largest possible value of $\eta\cdot\text{reg}(\mathbf{A})$ is $O(\varepsilon)$ , and can thus be omitted without affecting the result. Similarly, if $\eta>(1/\varepsilon)n^{t^{\prime}}$ , for sufficiently large constant $t^{\prime}>0$ , since there are at most ${n\choose 2}$ constraints, it follows that the term $\mathsf{cost}(\mathbf{A})$ can be omitted form the objective. Therefore, we may assume w.l.o.g. that $\text{reg}(A_{0})\in[\varepsilon^{O(1)},(1/\varepsilon)n^{O(1)}]$ . We can guess some $i=O(\log n+\log(1/\varepsilon))$ , such that $\text{reg}(A_{0})\in((1+\varepsilon)^{i-1},(1+\varepsilon)^{i}]$ . We modify the SDP used in the proof of Lemma 2.2 by introducing the constraint $\text{reg}(\mathbf{A})\leq(1+\varepsilon)^{i}$ . Guessing the correct value of $i$ requires $O(\log n+\log(1/\varepsilon))$ executions of Algorithm 2, which implies the running time bound. ∎

3 Practical Improvements and Parallelization

We now discuss some modifications of the algorithm described in the previous section that significantly improve its performance in practical scenarios, and have been integrated in our implementation.

Move-to-front and pivoting heuristics.

We use heuristics that have been previously used in algorithms for linear programming [Seidel(1990), Clarkson(1995)], minimum enclosing ball in $\mathbb{R}^{3}$ [Megiddo(1983)], minimum enclosing ball and ellipsoid is $\mathbb{R}^{d}$ , for any fixed $d\in\mathbb{N}$ [Welzl(1991)], as well as in fast implementations of minimum enclosing ball algorithms [Gärtner(1999)]. The move-to-front heuristic keeps an ordered list of constraints which gets reorganized as the algorithm runs; when the algorithm finds a violation, it moves the violating constraint to the beginning of the list of the current sub-problem. The pivoting heuristic further improves performance by choosing to add to the basis the constraint that is “violated the most”. For instance, for similarity constraints, we pick the one that is mapped to the largest distance greater than $u$ ; for dissimilarity constraints, we pick the one that is mapped to the smallest distance less than $\ell$ .

Approximate counting.

The main loop of Algorithm 2 involves counting the number of violated constraints in each iteration. In problems involving a large number of constraints, we use approximate counting by only counting the number of violations within a sample of $O(\log 1/\varepsilon)$ constraints.

Early termination.

A bottleneck of Algorithm 2 stems from the fact that the inner loop needs to be executed for $\log^{O(d^{2})}n$ iterations. In practice, we have observed that a significantly smaller number of iterations is needed to achieve high accuracy. We denote by $\mathsf{LPTML}_{t}$ for the version of the algorithm that performs a total of $t$ iterations of the inner loop.

Parallelization.

Algorithm 2 consists of several executions of the algorithm $\mathsf{Exact\text{-}LPTML}$ on independently sampled sub-problems. Therefore, Algorithm 2 can trivially be parallelized by distributing a different set of sub-problems to each machine, and returning the best solution found overall.

4 Experimental Evaluation

We have implemented Algorithm 2, incorporating the practical improvements described in Section 3, and performed experiments on synthetic and real-world data sets. Our $\mathsf{LPTML}$ implementation and documentation can be found in our repository. We now describe the experimental setting and discuss the main findings.

4.1 Experimental Setting

Classification task.

Each data set used in the experiments consists of a set of labeled points in $\mathbb{R}^{d}$ . The label of each point indicates its class, and there is a constant number of classes. The set of similarity constraints ${\cal S}$ (respt. dissimilarity constraints ${\cal D}$ ) is formed by uniformly sampling pairs of points in the same class (resp. from different classes). We use various algorithms to learn a Mahalanobis metric for a labeled input point set in $\mathbb{R}^{d}$ , given these constraints. The values $u$ and $\ell$ are chosen as the $90$ th and $10$ th percentiles of all pairwise distances. We used 2-fold cross-validation: At the training phase we learn a Mahalanobis metric, and in the testing phase we use $k$ -NN classification, with $k=4$ , to evaluate the performance.

Data sets.

We have tested our algorithm on the following synthetic and real-world data sets:

1. Real-world: We have tested the performance of our implementation on the Iris, Wine, Ionosphere and Soybean data sets from the UCI Machine Learning Repository111https://archive.ics.uci.edu/ml/datasets.php.

2. Synthetic: Next, we consider a synthetic data set that is constructed by first sampling a set of $100$ points from a mixture of two Gaussians in $\mathbb{R}^{2}$ , with identity covariance matrices, and with means $(-3,0)$ and $(3,0)$ respectively; we then apply a linear transformation that stretches the $y$ axis by a factor of $40$ . This linear transformation reduces the accuracy of $k$ -NN on the underlying Euclidean metric with $k=4$ from 1 to 0.68.

3. Data poisoning: We modify the above synthetic data set by introducing a small fraction of points in an adversarial manner, before applying the linear transformation. Figure 2(b) depicts the noise added as five points labeled as one of the classes, and sampled from a Gaussian with identity covariance matrix and mean $(-100,0)$ (Figure 2(a)).

Algorithms.

We compare the performance of our algorithm against $\mathsf{ITML}$ and $\mathsf{LMNN}$ . We used the implementations provided by the authors of these works, with minor modifications.

4.2 Results

Accuracy.

Algorithm 2 minimizes the number of violated pairwise distance constraints. It is interesting to examine the effect of this objective function on the accuracy of $k$ -NN classification. Figure 1 depicts this relationship for the Wine data set. We observe that, in general, as the number of iterations of the main loop of $\mathsf{LPTML}$ increases, the number of violated pairwise distance constraints decreases, and the accuracy of $k$ -NN increases. This phenomenon remains consistent when we first perform PCA to $d=4,8,12$ dimensions.

Comparison to $\mathsf{ITML}$ and $\mathsf{LMNN}$ .

We compared the accuracy obtained by $\mathsf{LPTML}_{t}$ , for $t=2000$ iterations, against $\mathsf{ITML}$ and $\mathsf{LMNN}$ .

Table 1 summarizes the findings on the real-world and data sets and the synthetic data set without adversarial noise. We observe that $\mathsf{LPTML}$ achieves accuracy that is comparable to $\mathsf{ITML}$ and $\mathsf{LMNN}$ .

We observe that $\mathsf{LPTML}$ outperforms $\mathsf{ITML}$ and $\mathsf{LMNN}$ on the poisoned data set. This is due to the fact that the introduction of adversarial noise causes the relaxations used in $\mathsf{ITML}$ and $\mathsf{LMNN}$ to be biased towards contracting the $x$ -axis. In contrast, the noise does not “fool” $\mathsf{LPTML}$ because it only changes the optimal accuracy by a small amount. The results are summarized in Figure 3.

The effect of dimension.

The running time of $\mathsf{LPTML}$ grows with the dimension $d$ . This is caused mostly by the fact that the combinatorial dimension of the underlying LP-type problem is $O(d^{2})$ , and thus performing each basic operation requires solving an SDP with $O(d^{2})$ constraints. Figure 4 depicts the effect of dimensionality in the running time, for $t=100,\ldots,2000$ iterations of the main loop. The data set used is Wine after performing PCA to $d$ dimensions, for $d=2,\ldots,13$ .

Parallel implementation.

We implemented a massively parallel version of $\mathsf{LPTML}$ in the MapReduce model. The program maps different sub-problems of the main loop of $\mathsf{LPTML}$ to different machines. In the reduce step, we keep the result with the minimum number of constraint violations. The implementation uses the mrjob [Yelp and Contributors(2019)] package. For these experiments, we used Amazon cloud computing instances of type m4.xlarge, AMI 5.20.0 and configured with Hadoop. As expected, the training time decreases as the number of available processors increases (Figure 5). All technical details about this implementation can be found in the parallel section of the documentation of our code.

5 Conclusions

We have shown that the problem of learning a Mahalanobis metric space can be cast as an LP-type problem. This formulation allows us to obtain an efficient approximation scheme using tools from the theory of linear programming in low dimensions. Specifically, we present a near-linear time $(1+\varepsilon)$ -approximation algorithm that minimizes the number of violated constraints. Experimental evaluation demonstrates that when compared to prior work, our method is significantly more robust against small adversarial modifications of the input labelling. Our approach also leads to a fully parallelizable algorithm.

It is an interesting research direction to extend our approximation algorithm to other classes of metric learning problems. One such case is when the input is specified as a set of ordered triples $(x,y,z)$ , and the goal is to find a mapping $f$ with $\|f(x)-f(y)\|_{2}\leq\|f(x)-f(z)\|_{2}-m$ , for some margin $m>0$ (see [weinberger2009distance]). Another important direction is to obtain geometric approximation algorithms for non-linear metric learning primitives, such as mappings computed by small depth neural networks.

Bibliography14

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Chazelle(2000)] B. Chazelle. The discrepancy method: randomness and complexity . Cambridge University Press, 2000.
2[Clarkson(1995)] K. L. Clarkson. Las vegas algorithms for linear and integer programming when the dimension is small. Journal of the ACM (JACM) , 42(2):488–499, 1995.
3[Davis et al.(2007)Davis, Kulis, Jain, Sra, and Dhillon] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning , pages 209–216. ACM, 2007.
4[Gärtner(1999)] B. Gärtner. Fast and robust smallest enclosing balls. In European Symposium on Algorithms , pages 325–338. Springer, 1999.
5[Globerson and Roweis(2006)] A. Globerson and S. T. Roweis. Metric learning by collapsing classes. In Advances in neural information processing systems , pages 451–458, 2006.
6[Har-Peled(2011)] S. Har-Peled. Geometric approximation algorithms . Number 173. American Mathematical Soc., 2011.
7[Kulis et al.(2013)] B. Kulis et al. Metric learning: A survey. Foundations and Trends® in Machine Learning , 5(4):287–364, 2013.
8[Li and Tian(2018)] D. Li and Y. Tian. Survey and experimental study on metric learning methods. Neural Networks , 105:447–462, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Robust Mahalanobis Metric Learning via Geometric Approximation Algorithms

Abstract

1 Introduction

1.1 Our Contribution

Theorem 1.1**.**

1.2 Related Work

1.3 Organization

2 Mahalanobis Metric Learning as an LP-Type Problem

2.1 LP-type Problems

2.2 An LP-type Formulation

Lemma 2.1**.**

Proof.

Lemma 2.2**.**

Proof.

2.3 Algorithmic implications

An exact algorithm.

An (1+ε)(1+\varepsilon)(1+ε)-approximation algorithm.

Lemma 2.3** ([Har-Peled(2011)], Ch. 15).**

Proof of Theorem 1.1.

Regularization.

Theorem 2.4**.**

Proof.

3 Practical Improvements and Parallelization

Move-to-front and pivoting heuristics.

Approximate counting.

Early termination.

Parallelization.

4 Experimental Evaluation

4.1 Experimental Setting

Classification task.

Data sets.

Algorithms.

4.2 Results

Accuracy.

Comparison to ITML\mathsf{ITML}ITML and LMNN\mathsf{LMNN}LMNN.

The effect of dimension.

Parallel implementation.

5 Conclusions

Theorem 1.1.

Lemma 2.1.

Lemma 2.2.

An $(1+\varepsilon)$ -approximation algorithm.

Lemma 2.3 ([Har-Peled(2011)], Ch. 15).

Theorem 2.4.

Comparison to $\mathsf{ITML}$ and $\mathsf{LMNN}$ .