Less but Better: Generalization Enhancement of Ordinal Embedding via   Distributional Margin

Ke Ma; Qianqian Xu; Zhiyong Yang; Xiaochun Cao

arXiv:1812.01939·cs.LG·December 6, 2018

Less but Better: Generalization Enhancement of Ordinal Embedding via Distributional Margin

Ke Ma, Qianqian Xu, Zhiyong Yang, Xiaochun Cao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel margin distribution learning approach called DMOE to improve the generalization of ordinal embedding, especially with limited comparison data, by optimizing margin distribution rather than just margin size.

Contribution

The paper proposes a new paradigm for ordinal embedding that focuses on margin distribution, with a specific objective function and an efficient optimization algorithm, enhancing generalization with fewer samples.

Findings

01

DMOE outperforms classical methods on simulated datasets.

02

The approach improves embedding quality with limited comparison data.

03

Experimental results validate the effectiveness of margin distribution optimization.

Abstract

In the absence of prior knowledge, ordinal embedding methods obtain new representation for items in a low-dimensional Euclidean space via a set of quadruple-wise comparisons. These ordinal comparisons often come from human annotators, and sufficient comparisons induce the success of classical approaches. However, collecting a large number of labeled data is known as a hard task, and most of the existing work pay little attention to the generalization ability with insufficient samples. Meanwhile, recent progress in large margin theory discloses that rather than just maximizing the minimum margin, both the margin mean and variance, which characterize the margin distribution, are more crucial to the overall generalization performance. To address the issue of insufficient training samples, we propose a margin distribution learning paradigm for ordinal embedding, entitled Distributional…

Figures8

Click any figure to enlarge with its caption.

Tables8

Table 1. Table 1: Performance Comparison on synthetic dataset with 200 200 200 , 1000 1000 1000 and 10000 10000 10000 samples as training data, respectively

algorithm	min	median	max	std
GNMDS- $p$	0.419	0.447	0.476	0.016
STE- $p$	0.397	0.426	0.461	0.016
TSTE- $p$	0.440	0.468	0.498	0.014
DMOE	0.372	0.390	0.410	0.011

Table 2. (a) 200 200 200 samples

algorithm	min	median	max	std
GNMDS- $p$	0.419	0.447	0.476	0.016
STE- $p$	0.397	0.426	0.461	0.016
TSTE- $p$	0.440	0.468	0.498	0.014
DMOE	0.372	0.390	0.410	0.011

Table 3. (b) 1000 1000 1000 samples

min	median	max	std
0.318	0.341	0.359	0.009
0.375	0.385	0.401	0.007
0.426	0.441	0.466	0.011
0.281	0.298	0.305	0.008

Table 4. (c) 10000 10000 10000 samples

min	median	max	std
0.143	0.147	0.154	0.007
0.219	0.234	0.251	0.007
0.238	0.257	0.271	0.011
0.142	0.146	0.151	0.008

Table 5. Table 2: Performance Comparison on music artists dataset with 200 , 500 200 500 200,500 , 1000 1000 1000 and 5000 5000 5000 samples as training data.

algorithm	min	median	max	std
GNMDS- $p$	0.391	0.403	0.416	0.007
STE- $p$	0.444	0.455	0.475	0.008
TSTE- $p$	0.416	0.436	0.458	0.011
DOME	0.372	0.385	0.400	0.007

Table 6. (a) 200 200 200 samples

algorithm	min	median	max	std
GNMDS- $p$	0.391	0.403	0.416	0.007
STE- $p$	0.444	0.455	0.475	0.008
TSTE- $p$	0.416	0.436	0.458	0.011
DOME	0.372	0.385	0.400	0.007

Table 7. (b) 1000 1000 1000 samples

min	median	max	std
0.307	0.317	0.332	0.007
0.397	0.415	0.429	0.007
0.377	0.389	0.406	0.007
0.281	0.291	0.307	0.007

Table 8. (c) 5000 5000 5000 samples

min	median	max	std
0.225	0.239	0.257	0.006
0.252	0.275	0.294	0.011
0.243	0.259	0.297	0.013
0.216	0.227	0.244	0.007

Equations122

Q = {q ∣

Q = {q ∣

i \neq = j, l \neq = k, i, j, l, j \in [n]}

y_{q} = {+ 1, - 1, if ζ_{ij} > ζ_{l k}, if ζ_{ij} < ζ_{l k} .

y_{q} = {+ 1, - 1, if ζ_{ij} > ζ_{l k}, if ζ_{ij} < ζ_{l k} .

sign (y_{q} \cdot Δ_{q} D) > 0, \forall y_{q} \in Y_{Q},

sign (y_{q} \cdot Δ_{q} D) > 0, \forall y_{q} \in Y_{Q},

Δ_{q} D = d_{ij}^{2} - d_{l k}^{2} = ∥ x_{i} - x_{j} ∥_{2}^{2} - ∥ x_{l} - x_{k} ∥_{2}^{2} .

Δ_{q} D = d_{ij}^{2} - d_{l k}^{2} = ∥ x_{i} - x_{j} ∥_{2}^{2} - ∥ x_{l} - x_{k} ∥_{2}^{2} .

γ_{q} = y_{q} \cdot Δ_{q} D,

γ_{q} = y_{q} \cdot Δ_{q} D,

d_{ij}

d_{ij}

D

γ_{q}

γ_{q}

=

: =

l (x) : {> 0, \leq 0, if x < 0, if x > 0.

l (x) : {> 0, \leq 0, if x < 0, if x > 0.

G min

G min

s . t .

G, ξ min

G, ξ min

s . t .

G ⪰ 0, rank (G) \leq p,

\overset{γ}{ˉ}

\overset{γ}{ˉ}

\overset{γ}{^} = \frac{1}{∣ Q ∣} q \in Q \sum (γ_{q} - \overset{γ}{ˉ})^{2} .

\overset{γ}{^} = \frac{1}{∣ Q ∣} q \in Q \sum (γ_{q} - \overset{γ}{ˉ})^{2} .

G, ξ min

G, ξ min

s . t .

G ⪰ 0, rank (G) \leq p,

\frac{1}{∣ Q ∣} q \in Q \sum γ_{q} = \tilde{γ}_{0} .

\frac{1}{∣ Q ∣} q \in Q \sum γ_{q} = \tilde{γ}_{0} .

∣ γ_{q} - \tilde{γ}_{0} ∣ \leq ε_{q}, \forall q \in Q .

∣ γ_{q} - \tilde{γ}_{0} ∣ \leq ε_{q}, \forall q \in Q .

γ_{q}

γ_{q}

γ_{q}

γ_{q} \geq \tilde{γ}_{0} - ξ_{q}

γ_{q} \geq \tilde{γ}_{0} - ξ_{q}

G, ξ, ε min

G, ξ, ε min

s . t .

ξ_{q} \geq 0, ε_{q} \geq 0, \forall q \in Q,

G ⪰ 0, rank (G) \leq p .

ℓ_{\tilde{γ}_{0}, ν} (γ_{q}) = max (\tilde{γ}_{0} - γ_{q}, 0) + ν \cdot max (γ_{q} - \tilde{γ}_{0}, 0) .

ℓ_{\tilde{γ}_{0}, ν} (γ_{q}) = max (\tilde{γ}_{0} - γ_{q}, 0) + ν \cdot max (γ_{q} - \tilde{γ}_{0}, 0) .

∣ ζ ∣_{τ} = {0, ∣ ζ ∣ - τ, if ∣ ζ ∣ \leq τ, otherwise

∣ ζ ∣_{τ} = {0, ∣ ζ ∣ - τ, if ∣ ζ ∣ \leq τ, otherwise

G_{μ} = {G \in S_{+}^{n} : ∥ G ∥_{\infty} \leq μ, ∥ G ∥_{*} \leq λ}

G_{μ} = {G \in S_{+}^{n} : ∥ G ∥_{\infty} \leq μ, ∥ G ∥_{*} \leq λ}

R (\hat{G}) - R (G^{*})

R (\hat{G}) - R (G^{*})

\leq

R (G) = E [ℓ (γ_{q})] = \frac{1}{∣ Q ∣} q \in Q \sum p_{q} ℓ (γ_{q}) + (1 - p_{q}) ℓ (- γ_{q}),

R (G) = E [ℓ (γ_{q})] = \frac{1}{∣ Q ∣} q \in Q \sum p_{q} ℓ (γ_{q}) + (1 - p_{q}) ℓ (- γ_{q}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alphaprime/DMOE
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Domain Adaptation and Few-Shot Learning · Text and Document Classification Technologies

Full text

Less but Better:

Generalization Enhancement of Ordinal Embedding via Distributional Margin

Ke Ma1,2, Qianqian Xu3, Zhiyong Yang1,2, Xiaochun Cao1

1 State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences

2 School of Cyber Security, University of Chinese Academy of Sciences

3 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences

{make, yangzhiyong, caoxiaochun}@iie.ac.cn, [email protected] The corresponding authors.

Abstract

In the absence of prior knowledge, ordinal embedding methods obtain new representation for items in a low-dimensional Euclidean space via a set of quadruple-wise comparisons. These ordinal comparisons often come from human annotators, and sufficient comparisons induce the success of classical approaches. However, collecting a large number of labeled data is known as a hard task, and most of the existing work pay little attention to the generalization ability with insufficient samples. Meanwhile, recent progress in large margin theory discloses that rather than just maximizing the minimum margin, both the margin mean and variance, which characterize the margin distribution, are more crucial to the overall generalization performance. To address the issue of insufficient training samples, we propose a margin distribution learning paradigm for ordinal embedding, entitled Distributional Margin based Ordinal Embedding (DMOE). Precisely, we first define the margin for ordinal embedding problem. Secondly, we formulate a concise objective function which avoids maximizing margin mean and minimizing margin variance directly but exhibits the similar effect. Moreover, an Augmented Lagrange Multiplier based algorithm is customized to seek the optimal solution of DMOE effectively. Experimental studies on both simulated and real-world datasets are provided to show the effectiveness of the proposed algorithm.

The problem of analyzing a set of $n$ objects given similarity information is an inherent part in a broad variety of tasks in artificial intelligence (?; ?), machine learning (?; ?; ?; ?), information retrieval (?), data mining (?) and computer vision (?). Many algorithms are based on the assumption that ‘similar’ inputs should generate ‘close’ outputs. In a numerical setting of embedding, a similarity function (or, equivalently, a dissimilarity function) quantifies how ‘similar’ objects are to others. The required input is the distance or similarity matrix of items. We calculate a set of embedded points which aims to preserve such similarities as well as possible. However, in recent years a whole new branch of the literature has emerged, which is the comparison-based embedding. Instead of evaluating similarity directly, we collect the similarity comparisons as follows:

“Is the similarity between object $i$ and $j$ larger than the similarity between $l$ and $k$ ?”

The corresponding problem is ordinal embedding. These two types of supervision information, numerical similarities and relative comparisons, are all generated by human beings. Nevertheless, the latter one provides similarity estimates on a relative scale instead of the absolute scale. The comparison-based setting is a special case of the observation that humans are better at comparing two stimuli than at identifying a single one (?). Consequently, the relative comparison is a more reliable form for incorporating human knowledge with artificial intelligence tasks.

The ordinal embedding problem was firstly studied by (?; ?; ?; ?) in the psychometric society. In recent years, it has drawn a lot of attention (?; ?; ?; ?; ?; ?; ?; ?; ?). One class of these typical methods is margin-based ordinal embedding which solves the problem under the classification framework. The well-known Generalized Non-Metric Multidimensional Scaling (GNMDS) (?) aims at finding a Gram matrix $\boldsymbol{G}$ such that the pairwise distances of embedded points satisfy the partial order constraints. Stochastic Triplet Embedding (STE/TSTE) (?) is proposed to jointly penalize the violated constraints and reward the satisfied constraints via logistic loss. Multi-view Triplet Embedding (MTE) (?) decomposes the STE objective function as different components and re-weights them for a better explanation. The other class of ordinal embedding methods uses the nearest neighbor graphs to model the similarity comparisons. Structure Preserving Embedding (SPE) (?) and Local Ordinal Embedding (LOE) (?) embed unweighted nearest neighbor graphs to Euclidean spaces with convex and non-convex objective functions. The nearest neighbor adjacency matrix can be transformed into ordinal constraints, but it is not a standard equipment in comparison-based scenarios. With this limitation, SPE and LOE are not suitable for ordinal embedding via quadruplets or triple comparisons.

A common issue of the existed ordinal embedding methods is the dependence of large samples of similarity comparisons. (?; ?) show the consistency of ordinal embedding problem. When the number of the objects $n$ tends to infinity, the set of embedded points always converges to the set of original points, up to similarity transformations; the rate of convergence depends on the Hausdorff distance between the ground-truth points. Later (?) show a finite sampling result of consistency. Learning an embedding which predicts nearly as well as the true embedding needs $\boldsymbol{\Theta}(pn\log n)$ samples, where $p$ is the embedding dimension. There is a strong condition that the triple-wise comparisons are generated from the classical Bradley-Terry-Luce (BTL) model (?; ?) and this assumption could not be verified in the actual applications. The theoretical results suggest that only the adequateness of similarity comparisons can promise the prediction result. However, the cost of eliciting relative similarity comparisons from human beings would be prohibitive. The amenable applications for collecting the relative similarity comparisons, e.g., crowdsourcing and human computation, need passively waiting for participants and stimulate them with money to get the desired information. Without prior knowledge, the relative comparisons always involve all objects, and the number of possible comparisons could be $\boldsymbol{\Theta}(n^{4})$ . The spending of data collection presents ordinal embedding methods with a dilemma: the insufficient samples would limit the potential performance; the adequate samples with prospective results would be cumbersome. Unfortunately, most of the traditional methods ignore that the generalization is the main concern in ordinal embedding task with insufficient samples.

In this paper, we propose a new method, named Distributional Margin based Ordinal Embedding (DMOE), which tries to achieve strong generalization performance by optimizing the margin distribution in ordinal embedding problem. Inspired by the recent results in classification (?; ?), we define the margin of ordinal embedding and characterize the margin distribution by the first- and second-order statistics, and try to maximize the margin mean and minimize the margin variance simultaneously. For optimization, we propose an alternating direction method of multipliers (ADMM) for DMOE with semi-definite and low-rank constraints. Comprehensive experiments on the synthetic and real-world datasets show the superiority of our method to other ordinal embedding algorithms, verifying that the margin distribution is more crucial for generalization than minimum margin.

Problem Definition

Throughout the paper, scalars, vectors, matrices and sets are denoted as lowercase letters ( $x$ ), bold lower case letters ( $\boldsymbol{x}$ ), bold capital letters ( $\boldsymbol{X}$ ) and calligraphy uppercase letters ( $\mathcal{X}$ ). $x_{ij}$ denotes the $(i,j)$ entry of $\boldsymbol{X}$ . $[n]$ is the set of $\{1,\dots,n\}$ . $\mathbb{E}(\cdot)$ represents the expectation.

Suppose $\mathcal{O}=\{\boldsymbol{o}_{1},\dots,\boldsymbol{o}_{n}\}$ is a set of $n$ objects, we assume that a certain but unknown similarity function $\zeta:\mathcal{O}\times\mathcal{O}\rightarrow\mathbb{R}^{+}$ assigns similarity value $\zeta_{ij}$ for a pair of objects $(\boldsymbol{o}_{i},\boldsymbol{o}_{j})$ . With similarity function $\zeta$ , a quadruplet $q=(i,j,l,k)$ defines the corresponding ordinal constraint, and these constraints lead to the ordinal embedding problem.

Definition 1 (Ordinal Constraints).

Given a set of quadruplets

[TABLE]

which is a subset of $[n]^{4}$ , the ordinal constraints $\mathcal{Y}_{\mathcal{Q}}=\{y_{q}|q\in\mathcal{Q}\}\subset\{-1,+1\}^{|\mathcal{Q}|}$ , implies the similarity partial order of object pairs in $\mathcal{O}$ as

[TABLE]

Our goal here is to obtain a set of embedded points $\boldsymbol{X}$ which satisfy the ordinal constraints $\mathcal{Q}$ . Without prior knowledge, embedding $\mathcal{O}$ into a Euclidean space $\mathbb{R}^{p}$ is the most common situation which assumes that the squared Euclidean distances among embedded points are inversely proportional to the unknown similarity values. Specifically, a large distance of two embedded points $d^{2}_{ij}=\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\|^{2}_{2}$ means the corresponding objects $\boldsymbol{o}_{i}$ and $\boldsymbol{o}_{j}$ would have small similarity value $\zeta_{ij}$ . This assumption connects the squared Euclidean distances of $\boldsymbol{X}$ and ordinal constraints $\mathcal{Q}$ . We further give the formal definition of ordinal embedding.

Definition 2 (Ordinal Embedding).

Suppose $\mathcal{Q}$ is a collection of quadruplets which are drawn independently and uniformly at random and $\mathcal{Y}_{\mathcal{Q}}$ is the correspondence ordinal constraints of object set $\mathcal{O}$ . Let $\boldsymbol{X}=\{\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{n}\}\in\mathbb{R}^{p\times n}$ is the desired embedding in the Euclidean space $\mathbb{R}^{p}$ where $p\ll n$ and $\boldsymbol{D}=\{d^{2}_{ij}\}\in\mathbb{R}^{n\times n}$ is the squared Euclidean distance matrix of embedding $\boldsymbol{X}$ . Ordinal embedding is the problem of obtaining $\boldsymbol{X}$ with ordinal constraints $\mathcal{Y}_{\mathcal{Q}}$ on $\boldsymbol{D}$ such that

[TABLE]

where

[TABLE]

Note that one cannot consistently estimate the underlying embedding $\boldsymbol{X}$ with only ordinal supervision and without direct observations. In the case when no direct measurements are available, say the metric information of $\mathcal{O}$ as the input, the underlying embedding $\boldsymbol{X}$ is only identifiable up to certain monotonic transformations, e.g., rotation, reflection, translation, and scaling. Therefore, the sign consistency is adopted as the goal of ordinal embedding.

By the above definition, the margin of instance $(\boldsymbol{X}_{q},y_{q})$ can be naturally defined as

[TABLE]

where $\boldsymbol{X}_{q}=\{\boldsymbol{x}_{i},\boldsymbol{x}_{j},\boldsymbol{x}_{l},\boldsymbol{x}_{k}\},\ \forall\ q\in\mathcal{Q}$ .

Despite the close relationship between $\boldsymbol{D}$ and $\boldsymbol{X}$ , $\Delta_{q}\boldsymbol{D}$ is a nonlinear function of $\boldsymbol{X}$ and it always leads to a non-convex optimization problem. Here we introduce the Gram matrix of $\boldsymbol{X}$ and construct a margin function as a linear function of Gram matrix. Firstly, a map is established to connect the distance matrix $\boldsymbol{D}$ and the Gram matrix $\boldsymbol{G}=\boldsymbol{X}^{\top}\boldsymbol{X}$ :

[TABLE]

where $\textit{diag}(\boldsymbol{G})$ is the column vector composed of the diagonal entries of $\boldsymbol{G}$ and $\boldsymbol{1}$ is the $n$ -dimension vector with all entries being $1$ . With a little abuse of $\Delta_{q}$ , the margin of instance $(\boldsymbol{X}_{q},y_{q})$ can be written as

[TABLE]

By the definition of (4), ordinal embedding can be formulated as the following convex optimization problem.

Definition 3 (The Margin-based Ordinal Embedding).

Let $l:\mathbb{R}^{+}\rightarrow\mathbb{R}$ be a loss function which satisfies

[TABLE]

Given the ordinal constraints $\mathcal{Y}_{\mathcal{Q}}$ , the ordinal embedding problem can be formulated as a semi-definite programming of Gram matrix $\boldsymbol{G}$ :

[TABLE]

where $L(\boldsymbol{G},\mathcal{Y}_{\mathcal{Q}})=\frac{1}{|\mathcal{Q}|}\underset{q\in\mathcal{Q}}{\sum}\ \ l(\gamma_{q})$ .

We note that (LABEL:eq:sdp_oe) is a semi-definite programming (SDP) and $\boldsymbol{G}\succeq 0$ comes from the fact that $\boldsymbol{G}=\boldsymbol{X}^{\top}\boldsymbol{X}$ is a positive semi-definite matrix. Furthermore, the desired embedding dimension $p$ is a parameter of the ordinal embedding. It is well known that there exists a perfect embedding $\boldsymbol{X}$ estimated by any label set $\mathcal{Y}$ on the Euclidean distances in $\mathbb{R}^{n-2}$ , even for the noisy constraints (?). However, the low-dimensional setting where $p\ll n$ is the main task of this work. The smallest $p$ for noisy ordinal constraints $\mathcal{Y}_{\mathcal{Q}}$ is a future direction which worths pursuing. The choice of $p$ in the experiment section depends on the potential applications.

For example, the Generalized Non-metric Multidimensional Scaling (GNMDS) follows the SVM formulation to obtain the $\boldsymbol{G}$ by solving

[TABLE]

where $\gamma_{0}$ is a relaxed minimum margin and $\boldsymbol{\xi}=\{\xi_{q}\}_{q\in\mathcal{Q}}$ is the slack variable.

Distributional Margin based Embedding

The relaxed minimum margin $\gamma_{0}$ in GNMDS indeed characterizes the top minimum margins of all instance $\{\boldsymbol{X}_{q},y_{q}\}_{q\in\mathcal{Q}}$ . In margin theory of classification, it is known that maximizing the minimum margin of training examples is not sufficient to achieve fulfilling generalization performance (?). The margin distribution of training examples, rather than the minimum margin, is more crucial to generalization performance in classification (?; ?).

Formulation

The two most usual statistics for characterizing the margin distribution are the first- and second-order statistics, that is, the mean and the variance of margin. According to (4), the margin mean of training samples $\{(\boldsymbol{X}_{q},y_{q})\}$ is

[TABLE]

and the margin variance is

[TABLE]

Intuitively, we attempt to maximize the margin mean and minimize the margin variance simultaneously in ordinal embedding problem (LABEL:eq:sdp_oe).

First, there is a straightforward idea to achieve our goal as considering the margin mean (8) and the margin variance (9) in (LABEL:eq:sdp_oe) explicitly. Although (LABEL:eq:sdp_oe) can adopt different loss functions, we will focus on SVM formulation (7) because the hinge loss is a natural form of margin. Considering the margin distribution, the optimization problem (7) can be formulated as

[TABLE]

where $\lambda_{1}$ and $\lambda_{2}$ are the trade-off parameters for balancing the impacts of $\bar{\gamma}$ and $\hat{\gamma}$ . It is apparent that GNMDS (7) is a degenerate case of (10) when $\lambda_{1}$ and $\lambda_{2}$ equal to [math]. However, there exists an obvious drawback of (10) with directly optimizing the margin distribution: tuning the parameters, $\lambda_{1}$ and $\lambda_{2}$ , is an obstacles of solving (10) efficiently. Therefore, a new lightweight formulation is proposed to optimize margin distribution implicitly.

Recall that SVM fixes the minimum margin as $1$ by scaling the margin with the norm of linear predictor. Following the similar way, we can scale the margin of $(\boldsymbol{X}_{q},y_{q})$ in ordinal embedding (4) and set the margin mean as a constant. This would not result in a sub-optimal solution because the ordinal constraints $\mathcal{Y}_{\mathcal{Q}}$ can only determine an embedding $\boldsymbol{X}$ up to the monotonic transformations. Without loss of generality, the mean of $\boldsymbol{\gamma}_{\mathcal{Q}}=\{\gamma_{q}|q\in\mathcal{Q}\}$ can be set as a constant and an equality constraint is conducted

[TABLE]

On the other hand, we want to minimize the variance of $\boldsymbol{\gamma}_{\mathcal{Q}}$ . By (11), the deviation of $\gamma_{q}$ to the margin mean $\tilde{\gamma}_{0}$ is $|\gamma_{q}-\tilde{\gamma}_{0}|$ , and we force the deviation to be smaller than $\varepsilon_{q}\geq 0$ as

[TABLE]

Thus, minimizing $\varepsilon_{q}$ is equivalent to minimize the margin variance (9). Meanwhile, (12) implies the margin mean constraint (11).

Constraints like (12) in optimization problems always involve two inequality, $\forall\ q\in\mathcal{Q}$ ,

[TABLE]

Note that the soft-margin constraint

[TABLE]

plays the same role as (13b). Replacing (13b) with (14) and adding (13) into (10), we arrive at the following formulation

[TABLE]

This optimization problem corresponds to dealing with such a loss function, $\forall\ q\in\mathcal{Q}$

[TABLE]

The trading-off parameter $\nu$ in (15) can capture the asymmetry between the sign correctness and the dispersion of $\{\gamma_{q}\}_{q\in\mathcal{Q}}$ . When $\nu=1$ and ignoring the semi-definite and rank constraints, (15) is similar to the support vector regression (SVR) (?). In SVR, the $\tau$ -insensitive loss

[TABLE]

produces two slack variables $\xi$ and $\xi^{*}$ for each training example to guard against outliers. As $\nu=1$ , the loss function (16) is explicitly the same loss function adopted by SVR as $\tau=0$ . In our formulation (15), $\xi_{q}$ and $\varepsilon_{q}$ conduct the similar constraints like SVR but have totally different meanings. All the training examples are used to learn the margin distribution in (15), but the optimal solution of SVR is only spanned by the support vectors which is sparse in the training data. Figure 1 depicts the situations of the learned margin distribution graphically. Some theoretical results are provided at the end of this section.

Theorem 1.

Suppose that

[TABLE]

and the true Gram matrix $\boldsymbol{G}^{*}\in\mathcal{G}_{\mu}$ . Let $\hat{\boldsymbol{G}}$ be a solution of (15). With probability at least $1-\delta$ , it holds that

[TABLE]

where $R(\cdot)$ is the risk, as for any $\boldsymbol{G}\in\mathcal{G}_{\mu}$

[TABLE]

$p_{q}=\mathbb{P}(y_{q}=1)$ . Here the expectation respects to both the uniformly random selection of the quadruplet $q$ and its label $y_{q}$ .

Theorem 1 says that $|\mathcal{Q}|$ must scale like $\Theta(pn\log n)$ which leads to the bounded error $R(\hat{\boldsymbol{G}})-R(\boldsymbol{G}^{*})$ . This result is consistent with known finite sample bounds (?). The details are provided in the supplementary materials. The generalization bound of margin distribution is a future direction.

Optimization

Consider that $\Delta_{q}$ is a linear operator on $\mathbb{S}^{n}_{+}$ , $\Delta_{q}$ has its symmetric $n\times n$ matrix form in $\mathbb{S}^{n}_{+}$ , the positive semi-definite cone of $n\times n$ symmetric matrix. Given a ordinal constraints $q=(i,j,l,k)$ , $\boldsymbol{K}_{q}$ is the matrix form of $\Delta_{q}$ where

[TABLE]

and

[TABLE]

With the trick that

[TABLE]

we note

[TABLE]

where $\boldsymbol{y}_{\mathcal{Q}}\in\{-1,+1\}^{|\mathcal{Q}|}=[y_{1},\dots,y_{|\mathcal{Q}|}]^{\top}$ , $\boldsymbol{\Gamma}_{0}$ is a $|\mathcal{Q}|$ -dimension vector with all entries are $\tilde{\gamma}_{0}$ and $\odot$ is the Hadamard product. Furthermore, we introduce the redundant variables to make the objective separable which can be solved by the ALM framework efficiently. The optimization is converted into:

[TABLE]

where $\mathcal{T}=\{\boldsymbol{G},\boldsymbol{G}_{1},\boldsymbol{G}_{2},\boldsymbol{e}_{1},\boldsymbol{e}_{2}\}$ is the set of all the parameters to be solved and $\|\cdot\|_{*}$ is the nuclear norm which is the convex surrogate of matrix rank constraints. It is worth mentioning that (23) is a convex optimization problem as the feasible set of each constraint is a convex set and the objective function is convex. The Lagrange function of (23) can be written in the following form:

[TABLE]

with

[TABLE]

$\|\cdot\|$ is $\ell_{2}$ norm for vector and the Frobenius norm for matrix. In addition, $\boldsymbol{z}_{1},\boldsymbol{z}_{2}\in\mathbb{R}^{|\mathcal{Q}|}$ and $\boldsymbol{Z}_{3},\boldsymbol{Z}_{4}\in\mathbb{R}^{n\times n}$ are Lagrange multipliers. $\delta$ is the Dirac delta function whose function value would be infinity if the condition is not satisfied. Below are the solutions to each sub-problem.

$\boldsymbol{e}_{1}$ ** sub-problem. **With the variables unrelated to $\boldsymbol{e}_{1}$ fixed, we have the sub-problem of $\boldsymbol{e}_{1}$ :

[TABLE]

where

[TABLE]

It’s worth noting that $(\cdot)_{+}$ is a piece-wise linear function. Thus, to seek the minimum of each element in $\boldsymbol{e}_{1}$ , we just need to pick the smaller value between $y_{q}e^{1}_{q}$ and [math]. The solution of (25) is

[TABLE]

where $\Omega:\mathbb{R}^{\mathcal{Q}}\rightarrow\mathbb{R}^{\mathcal{Q}}$ is an indicator function as $[\Omega(\boldsymbol{w})]_{q}=\mathbb{I}(w_{q}>0)\cdot w_{q},\ \boldsymbol{w}\in\mathbb{R}^{\mathcal{Q}}$ and $\bar{\Omega}$ is the complementary support of $\Omega$ . The definition of shrinkage operator on scalars is $\mathcal{S}_{\tau>0}[u]=\textit{sign}(u)(|u|-\tau)_{+}$ and it is an element-wise operator for vector and matrix.

$\boldsymbol{e}_{2}$ ** sub-problem. **Similarly, picking out the terms related to $\boldsymbol{e}_{2}$ gives the following sub-problem:

[TABLE]

where

[TABLE]

and the solution of $\boldsymbol{e}_{2}$ sub-problem is just replaced the $\boldsymbol{s}^{(t)}_{2}$ with $\boldsymbol{s}^{(t)}_{1}$ in (26).

$\boldsymbol{G}$ ** sub-problem. **Dropping the terms independent on $\boldsymbol{G}$ leads to the following problem:

[TABLE]

We have

[TABLE]

and note the right-hand side as $\boldsymbol{w}$ , we have

[TABLE]

and $\boldsymbol{G}^{(t+1)}$ is the matrix form of $\textit{vec}(\boldsymbol{G})^{(t+1)}$ .

$\boldsymbol{G}_{1}$ ** sub-problem. ** There are two terms in (24) involving $\boldsymbol{G}_{1}$ . The associated optimization problem of $\boldsymbol{G}_{1}$ is

[TABLE]

and solving this problem yields

[TABLE]

where

[TABLE]

and $\mathcal{S}_{\tau}(\cdot)$ is the shrinkage operator.

$\boldsymbol{G}_{2}$ ** sub-problem. ** Considering the potential asymmetric of $\boldsymbol{G}^{(t)}$ , we claim that $\boldsymbol{G}_{2}$ is the nearest symmetric positive semi-definite matrix of $\boldsymbol{G}^{(t)}$ in Frobenius norm (?). By the following theorem, we show the explicit solution of $\boldsymbol{G}^{(t)}_{2}$ .

Theorem 2.

Suppose that $\boldsymbol{A}\in\mathbb{R}^{n\times n}$ , and let $\boldsymbol{B}=(\boldsymbol{A}+\boldsymbol{A}^{\top})/2$ , $\boldsymbol{C}=(\boldsymbol{A}-\boldsymbol{A}^{\top})/2$ be the symmetric and skew-symmetric parts of $\boldsymbol{A}$ respectively. If we do polar decomposition of $\boldsymbol{B}$ as $\boldsymbol{B}=\boldsymbol{UH}$ where $\boldsymbol{U}$ is orthogonal matrix $\boldsymbol{UU}^{\top}=\boldsymbol{I}$ and $\boldsymbol{H}$ is positive semi-definite matrix, $\boldsymbol{X}_{F}=(\boldsymbol{B}+\boldsymbol{H})/2$ is the unique approximation of $\boldsymbol{A}$ in the Frobenius norm with positive semi-definite constraint, and the distance $\rho_{F}(\boldsymbol{A})$ in the Frobenius norm from $\boldsymbol{A}$ to $\mathbb{S}^{n}_{+}$ is

[TABLE]

where $\sigma_{i}(\boldsymbol{B}),\ i=1,\dots,n$ is the eigenvalue of $\boldsymbol{B}$ .

Consequently, the explicit solution of $\boldsymbol{G}_{2}^{(t)}$ is

[TABLE]

where $\sqrt{\boldsymbol{A}}$ is the square root of $\boldsymbol{A}\in\mathbb{S}^{n}_{+}$ , $\boldsymbol{A}=\boldsymbol{VSV}^{-1}$ and $\sqrt{\boldsymbol{A}}=\boldsymbol{V}\boldsymbol{S}^{\frac{1}{2}}\boldsymbol{V}^{-1}$ , $\boldsymbol{S}$ is a diagonal matrix and $\boldsymbol{S}^{\frac{1}{2}}$ is element-wise square root of $\boldsymbol{S}$ .

For clarity, the procedure of solving (23) is outlined in Algorithm 1. The algorithm would not be terminated until the change of objective value in two successive iterations is smaller than a threshold (in the experiments, $0.001$ is the default setting).

Empirical Study

In this section, we show the results of simulations and real-world data experiments to demonstrate the effectiveness of the proposed algorithms. As the existed margin-based ordinal embedding methods, such as GNMDS, STE and TSTE, just use triple-wise comparisons as the ordinal constraints, we treat triple-wise comparisons as the input of the proposed algorithm for fair competition. The triple-wise comparisons $\mathcal{T}=\{(i,j,k)\}$ is a special case of quadruplets which means $l=i$ in $q=(i,j,l,k)\in\mathcal{Q}$ . The $\Delta_{p}$ of triplet $t=(i,j,k)$ is also a symmetric $n\times n$ matrix indicated by $(i,j,k)$ as

[TABLE]

Replacing $\boldsymbol{K}_{q}$ with $\boldsymbol{K}_{t}$ in those sub-problems in Algorithm 1, the proposed DMOE method could handle the triplets set $\mathcal{T}$ as the ordinal constraints. The reproducible code can be found here111https://github.com/alphaprime/DMOE.

Simulation

**Settings. **The synthesized dataset consists of $100$ points $\{\boldsymbol{x}_{i}\}_{i=1}^{100}\subset\mathbb{R}^{10}$ , where $\boldsymbol{x}_{i}\sim\mathcal{N}(\boldsymbol{0},\frac{1}{20}\boldsymbol{I})$ , $\boldsymbol{I}\in\mathbb{R}^{10\times 10}$ is the identity matrix. The possible similarity triple comparisons are generated based on the Euclidean distances between $\{\boldsymbol{x}_{i}\}$ . We randomly sample $|\mathcal{T}|=\{200,500,1,000,10,000\}$ triplets as the training set and the test set is the rest of all triplets. The embedding dimension is fixed to $10$ .

**Evaluation Metrics. **We employ the generalization error to evaluate generalization ability of various algorithms. As the learned Gram matrix $\boldsymbol{G}$ from partial triple comparisons set $\mathcal{T}\subset[n]^{3}$ may be generalized to unknown triplets, the percentage of held-out triplets which is not satisfied in the $\boldsymbol{G}$ is the generalization error of the learned embedding.

Competitors. We compare the proposed algorithm with three well-known ordinal embedding methods: GNMDS (?), STE and TSTE (?). Note that we adopt the optimization strategy proposed by (?), which performs gradient descent with line search, and projects the Gram matrix onto the subspace spanned by the top $p$ eigenvalues at each step (i.e. setting the smallest $n-p$ eigenvalues to [math]). We call the three competitors: GNMDS- $p$ , STE- $p$ and TSTE- $p$ , correspondingly. The optimization problem of GNMDS is (7). STE replaces the hinge loss by logistic loss in (7) and adopts Gaussian kernel to predict the label:

[TABLE]

where $\gamma^{(t)}_{p}=\Delta_{p}\boldsymbol{G}^{(t)}$ . TSTE employs the heavy-tailed Student-t kernel:

[TABLE]

The regularization parameters of the competitors are tuned for the best performance under the different settings.

**Results. **From Figure 2 and Table 1, the following phenomena can be observed. First of all, the generalization ability of all methods would be improved when the number of training samples increases. The decrease of standard derivation also improves the stability. Moreover, the proposed algorithm shows better generalization performance than the traditional methods in all four settings. Compared with GNMDS- $p$ /STE- $p$ /TSTE- $p$ which need more training samples, our method can achieve better results with fewer training samples. This is our main motivation to optimize the margin distribution instead of maximizing the minimum margin like the classic methods. Third, the results of GNMDS- $p$ verifies that only maximizing the minimum margin would not necessarily lead to better generalization performances as the STE- $p$ is better than GNMDS when train samples are few.

Music Artist Data

Settings. The music artist data is collected by (?) via a web-based survey in which $1,032$ users provided $213,472$ triplets on the similarity of $412$ music artists. We use the data pre-processed by (?) which includes only $9,107$ triplets for $n=400$ artists. The size of training samples is variant from $200$ to $5,000$ and the rest of triplets are treated as test set. The desired dimension of embedding is $d=9$ as these music artists can be classified by genre into $9$ categories.

**Results. **According to the experimental results, Figure 3 and Table 2, we have the following observations. DMOE still shows better prediction result than GNMDS- $p$ /STE- $p$ /TSTE- $p$ with the same number of noisy training samples. To achieve the same generalization error, DMOE needs the smallest number of training samples and STE- $p$ /TSTE- $p$ need five times more than DMOE. This real-world data experiment verifies the proposed method, DMOE, has strong generalization for ordinal embedding with small training samples. Although this dataset contains noise triplets and it is well-known that the calculation of mean and the variance is sensitive, the proposed method show the same magnitude of standard deviation and its results are not damaged by the potential wrong training samples. The robustness is still an open problem in ordinal embedding, and this is our future work.

Conclusion

The classical ordinal embedding algorithms always need a large number of labeled data to predict unknown similarity relationship among items from learned embedded points. As collecting high-quality, large-scale labeled data from human is a hard task, generalization ability is the main challenge when we could only access small numbers of relative comparisons. Incorporating margin distribution learning paradigm gives birth to a novel algorithm for ordinal embedding, namely DMOE. Comprehensive experiments on synthetic dataset and real-world dataset validate the superiority of our method to traditional methods which need more training data to achieve the same generalization.

Acknowledgment

The research of Ke Ma and Xiaochun Cao is supported by the National Key R&D Program of China (Grant No. 2016YFB0800603), the Key Program of the Chinese Academy of Sciences (No. QYZDB-SSW-JSC003) and the National Natural Science Foundation of China (No.U1636214, U1605252, 61733007). The research of Qianqian Xu is supported in part by the National Natural Science Foundation of China (No.61672514, 61390514, 61572042), the Beijing Natural Science Foundation (4182079), the Youth Innovation Promotion Association CAS, and the CCF-Tencent Open Research Fund.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Agarwal et al . 2007] Agarwal, S.; Wills, J.; Cayton, L.; Lanckriet, G. R.; Kriegman, D. J.; and Belongie, S. 2007. Generalized non-metric multidimensional scaling. International Conference on Artificial Intelligence and Statistics 11–18.
2[Amid and Ukkonen 2015] Amid, E., and Ukkonen, A. 2015. Multiview triplet embedding: Learning attributes in multiple maps. International Conference on Machine Learning 1472–1480.
3[Arias-Castro 2017] Arias-Castro, E. 2017. Some theory for ordinal embedding. Bernoulli 23(3):1663–1693.
4[Borg and Groenen 2003] Borg, I., and Groenen, P. 2003. Modern multidimensional scaling: theory and applications. Journal of Educational Measurement 40(3):277–280.
5[Bradley and Terry 1952] Bradley, R. A., and Terry, M. E. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39(3/4):324–345.
6[Drucker et al . 1997] Drucker, H.; Burges, C. J.; Kaufman, L.; Smola, A. J.; and Vapnik, V. 1997. Support vector regression machines. In Advances in Neural Information Processing Systems , 155–161.
7[Ellis et al . 2002] Ellis, D. P.; Whitman, B.; Berenzweig, A.; and Lawrence, S. 2002. The quest for ground truth in musical artist similarity. International Society for Music Information Retrieval Conference .
8[Gao and Zhou 2013] Gao, W., and Zhou, Z. 2013. On the doubt about margin explanation of boosting. Artificial Intelligence 203:1–18.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Less but Better:

Abstract

Problem Definition

Definition 1** (Ordinal Constraints).**

Definition 2** (Ordinal Embedding).**

Definition 3** (The Margin-based Ordinal Embedding).**

Distributional Margin based Embedding

Formulation

Theorem 1**.**

Optimization

Theorem 2**.**

Empirical Study

Simulation

Music Artist Data

Conclusion

Acknowledgment

Definition 1 (Ordinal Constraints).

Definition 2 (Ordinal Embedding).

Definition 3 (The Margin-based Ordinal Embedding).

Theorem 1.

Theorem 2.