Less but Better: Generalization Enhancement of Ordinal Embedding via Distributional Margin
Ke Ma, Qianqian Xu, Zhiyong Yang, Xiaochun Cao

TL;DR
This paper introduces a novel margin distribution learning approach called DMOE to improve the generalization of ordinal embedding, especially with limited comparison data, by optimizing margin distribution rather than just margin size.
Contribution
The paper proposes a new paradigm for ordinal embedding that focuses on margin distribution, with a specific objective function and an efficient optimization algorithm, enhancing generalization with fewer samples.
Findings
DMOE outperforms classical methods on simulated datasets.
The approach improves embedding quality with limited comparison data.
Experimental results validate the effectiveness of margin distribution optimization.
Abstract
In the absence of prior knowledge, ordinal embedding methods obtain new representation for items in a low-dimensional Euclidean space via a set of quadruple-wise comparisons. These ordinal comparisons often come from human annotators, and sufficient comparisons induce the success of classical approaches. However, collecting a large number of labeled data is known as a hard task, and most of the existing work pay little attention to the generalization ability with insufficient samples. Meanwhile, recent progress in large margin theory discloses that rather than just maximizing the minimum margin, both the margin mean and variance, which characterize the margin distribution, are more crucial to the overall generalization performance. To address the issue of insufficient training samples, we propose a margin distribution learning paradigm for ordinal embedding, entitled Distributional…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8| algorithm | min | median | max | std |
|---|---|---|---|---|
| GNMDS- | 0.419 | 0.447 | 0.476 | 0.016 |
| STE- | 0.397 | 0.426 | 0.461 | 0.016 |
| TSTE- | 0.440 | 0.468 | 0.498 | 0.014 |
| DMOE | 0.372 | 0.390 | 0.410 | 0.011 |
| algorithm | min | median | max | std |
|---|---|---|---|---|
| GNMDS- | 0.419 | 0.447 | 0.476 | 0.016 |
| STE- | 0.397 | 0.426 | 0.461 | 0.016 |
| TSTE- | 0.440 | 0.468 | 0.498 | 0.014 |
| DMOE | 0.372 | 0.390 | 0.410 | 0.011 |
| min | median | max | std |
|---|---|---|---|
| 0.318 | 0.341 | 0.359 | 0.009 |
| 0.375 | 0.385 | 0.401 | 0.007 |
| 0.426 | 0.441 | 0.466 | 0.011 |
| 0.281 | 0.298 | 0.305 | 0.008 |
| min | median | max | std |
|---|---|---|---|
| 0.143 | 0.147 | 0.154 | 0.007 |
| 0.219 | 0.234 | 0.251 | 0.007 |
| 0.238 | 0.257 | 0.271 | 0.011 |
| 0.142 | 0.146 | 0.151 | 0.008 |
| algorithm | min | median | max | std |
|---|---|---|---|---|
| GNMDS- | 0.391 | 0.403 | 0.416 | 0.007 |
| STE- | 0.444 | 0.455 | 0.475 | 0.008 |
| TSTE- | 0.416 | 0.436 | 0.458 | 0.011 |
| DOME | 0.372 | 0.385 | 0.400 | 0.007 |
| algorithm | min | median | max | std |
|---|---|---|---|---|
| GNMDS- | 0.391 | 0.403 | 0.416 | 0.007 |
| STE- | 0.444 | 0.455 | 0.475 | 0.008 |
| TSTE- | 0.416 | 0.436 | 0.458 | 0.011 |
| DOME | 0.372 | 0.385 | 0.400 | 0.007 |
| min | median | max | std |
|---|---|---|---|
| 0.307 | 0.317 | 0.332 | 0.007 |
| 0.397 | 0.415 | 0.429 | 0.007 |
| 0.377 | 0.389 | 0.406 | 0.007 |
| 0.281 | 0.291 | 0.307 | 0.007 |
| min | median | max | std |
|---|---|---|---|
| 0.225 | 0.239 | 0.257 | 0.006 |
| 0.252 | 0.275 | 0.294 | 0.011 |
| 0.243 | 0.259 | 0.297 | 0.013 |
| 0.216 | 0.227 | 0.244 | 0.007 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Domain Adaptation and Few-Shot Learning · Text and Document Classification Technologies
Less but Better:
Generalization Enhancement of Ordinal Embedding via Distributional Margin
Ke Ma1,2, Qianqian Xu3, Zhiyong Yang1,2, Xiaochun Cao1
1 State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences
2 School of Cyber Security, University of Chinese Academy of Sciences
3 Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences
{make, yangzhiyong, caoxiaochun}@iie.ac.cn, [email protected] The corresponding authors.
Abstract
In the absence of prior knowledge, ordinal embedding methods obtain new representation for items in a low-dimensional Euclidean space via a set of quadruple-wise comparisons. These ordinal comparisons often come from human annotators, and sufficient comparisons induce the success of classical approaches. However, collecting a large number of labeled data is known as a hard task, and most of the existing work pay little attention to the generalization ability with insufficient samples. Meanwhile, recent progress in large margin theory discloses that rather than just maximizing the minimum margin, both the margin mean and variance, which characterize the margin distribution, are more crucial to the overall generalization performance. To address the issue of insufficient training samples, we propose a margin distribution learning paradigm for ordinal embedding, entitled Distributional Margin based Ordinal Embedding (DMOE). Precisely, we first define the margin for ordinal embedding problem. Secondly, we formulate a concise objective function which avoids maximizing margin mean and minimizing margin variance directly but exhibits the similar effect. Moreover, an Augmented Lagrange Multiplier based algorithm is customized to seek the optimal solution of DMOE effectively. Experimental studies on both simulated and real-world datasets are provided to show the effectiveness of the proposed algorithm.
The problem of analyzing a set of objects given similarity information is an inherent part in a broad variety of tasks in artificial intelligence (?; ?), machine learning (?; ?; ?; ?), information retrieval (?), data mining (?) and computer vision (?). Many algorithms are based on the assumption that ‘similar’ inputs should generate ‘close’ outputs. In a numerical setting of embedding, a similarity function (or, equivalently, a dissimilarity function) quantifies how ‘similar’ objects are to others. The required input is the distance or similarity matrix of items. We calculate a set of embedded points which aims to preserve such similarities as well as possible. However, in recent years a whole new branch of the literature has emerged, which is the comparison-based embedding. Instead of evaluating similarity directly, we collect the similarity comparisons as follows:
“Is the similarity between object and larger than the similarity between and ?”
The corresponding problem is ordinal embedding. These two types of supervision information, numerical similarities and relative comparisons, are all generated by human beings. Nevertheless, the latter one provides similarity estimates on a relative scale instead of the absolute scale. The comparison-based setting is a special case of the observation that humans are better at comparing two stimuli than at identifying a single one (?). Consequently, the relative comparison is a more reliable form for incorporating human knowledge with artificial intelligence tasks.
The ordinal embedding problem was firstly studied by (?; ?; ?; ?) in the psychometric society. In recent years, it has drawn a lot of attention (?; ?; ?; ?; ?; ?; ?; ?; ?). One class of these typical methods is margin-based ordinal embedding which solves the problem under the classification framework. The well-known Generalized Non-Metric Multidimensional Scaling (GNMDS) (?) aims at finding a Gram matrix such that the pairwise distances of embedded points satisfy the partial order constraints. Stochastic Triplet Embedding (STE/TSTE) (?) is proposed to jointly penalize the violated constraints and reward the satisfied constraints via logistic loss. Multi-view Triplet Embedding (MTE) (?) decomposes the STE objective function as different components and re-weights them for a better explanation. The other class of ordinal embedding methods uses the nearest neighbor graphs to model the similarity comparisons. Structure Preserving Embedding (SPE) (?) and Local Ordinal Embedding (LOE) (?) embed unweighted nearest neighbor graphs to Euclidean spaces with convex and non-convex objective functions. The nearest neighbor adjacency matrix can be transformed into ordinal constraints, but it is not a standard equipment in comparison-based scenarios. With this limitation, SPE and LOE are not suitable for ordinal embedding via quadruplets or triple comparisons.
A common issue of the existed ordinal embedding methods is the dependence of large samples of similarity comparisons. (?; ?) show the consistency of ordinal embedding problem. When the number of the objects tends to infinity, the set of embedded points always converges to the set of original points, up to similarity transformations; the rate of convergence depends on the Hausdorff distance between the ground-truth points. Later (?) show a finite sampling result of consistency. Learning an embedding which predicts nearly as well as the true embedding needs samples, where is the embedding dimension. There is a strong condition that the triple-wise comparisons are generated from the classical Bradley-Terry-Luce (BTL) model (?; ?) and this assumption could not be verified in the actual applications. The theoretical results suggest that only the adequateness of similarity comparisons can promise the prediction result. However, the cost of eliciting relative similarity comparisons from human beings would be prohibitive. The amenable applications for collecting the relative similarity comparisons, e.g., crowdsourcing and human computation, need passively waiting for participants and stimulate them with money to get the desired information. Without prior knowledge, the relative comparisons always involve all objects, and the number of possible comparisons could be . The spending of data collection presents ordinal embedding methods with a dilemma: the insufficient samples would limit the potential performance; the adequate samples with prospective results would be cumbersome. Unfortunately, most of the traditional methods ignore that the generalization is the main concern in ordinal embedding task with insufficient samples.
In this paper, we propose a new method, named Distributional Margin based Ordinal Embedding (DMOE), which tries to achieve strong generalization performance by optimizing the margin distribution in ordinal embedding problem. Inspired by the recent results in classification (?; ?), we define the margin of ordinal embedding and characterize the margin distribution by the first- and second-order statistics, and try to maximize the margin mean and minimize the margin variance simultaneously. For optimization, we propose an alternating direction method of multipliers (ADMM) for DMOE with semi-definite and low-rank constraints. Comprehensive experiments on the synthetic and real-world datasets show the superiority of our method to other ordinal embedding algorithms, verifying that the margin distribution is more crucial for generalization than minimum margin.
Problem Definition
Throughout the paper, scalars, vectors, matrices and sets are denoted as lowercase letters (), bold lower case letters (), bold capital letters () and calligraphy uppercase letters (). denotes the entry of . is the set of . represents the expectation.
Suppose is a set of objects, we assume that a certain but unknown similarity function assigns similarity value for a pair of objects . With similarity function , a quadruplet defines the corresponding ordinal constraint, and these constraints lead to the ordinal embedding problem.
Definition 1** (Ordinal Constraints).**
Given a set of quadruplets
[TABLE]
which is a subset of , the ordinal constraints , implies the similarity partial order of object pairs in as
[TABLE]
Our goal here is to obtain a set of embedded points which satisfy the ordinal constraints . Without prior knowledge, embedding into a Euclidean space is the most common situation which assumes that the squared Euclidean distances among embedded points are inversely proportional to the unknown similarity values. Specifically, a large distance of two embedded points means the corresponding objects and would have small similarity value . This assumption connects the squared Euclidean distances of and ordinal constraints . We further give the formal definition of ordinal embedding.
Definition 2** (Ordinal Embedding).**
Suppose is a collection of quadruplets which are drawn independently and uniformly at random and is the correspondence ordinal constraints of object set . Let is the desired embedding in the Euclidean space where and is the squared Euclidean distance matrix of embedding . Ordinal embedding is the problem of obtaining with ordinal constraints on such that
[TABLE]
where
[TABLE]
Note that one cannot consistently estimate the underlying embedding with only ordinal supervision and without direct observations. In the case when no direct measurements are available, say the metric information of as the input, the underlying embedding is only identifiable up to certain monotonic transformations, e.g., rotation, reflection, translation, and scaling. Therefore, the sign consistency is adopted as the goal of ordinal embedding.
By the above definition, the margin of instance can be naturally defined as
[TABLE]
where .
Despite the close relationship between and , is a nonlinear function of and it always leads to a non-convex optimization problem. Here we introduce the Gram matrix of and construct a margin function as a linear function of Gram matrix. Firstly, a map is established to connect the distance matrix and the Gram matrix :
[TABLE]
where is the column vector composed of the diagonal entries of and is the -dimension vector with all entries being . With a little abuse of , the margin of instance can be written as
[TABLE]
By the definition of (4), ordinal embedding can be formulated as the following convex optimization problem.
Definition 3** (The Margin-based Ordinal Embedding).**
Let be a loss function which satisfies
[TABLE]
Given the ordinal constraints , the ordinal embedding problem can be formulated as a semi-definite programming of Gram matrix :
[TABLE]
where .
We note that (LABEL:eq:sdp_oe) is a semi-definite programming (SDP) and comes from the fact that is a positive semi-definite matrix. Furthermore, the desired embedding dimension is a parameter of the ordinal embedding. It is well known that there exists a perfect embedding estimated by any label set on the Euclidean distances in , even for the noisy constraints (?). However, the low-dimensional setting where is the main task of this work. The smallest for noisy ordinal constraints is a future direction which worths pursuing. The choice of in the experiment section depends on the potential applications.
For example, the Generalized Non-metric Multidimensional Scaling (GNMDS) follows the SVM formulation to obtain the by solving
[TABLE]
where is a relaxed minimum margin and is the slack variable.
Distributional Margin based Embedding
The relaxed minimum margin in GNMDS indeed characterizes the top minimum margins of all instance . In margin theory of classification, it is known that maximizing the minimum margin of training examples is not sufficient to achieve fulfilling generalization performance (?). The margin distribution of training examples, rather than the minimum margin, is more crucial to generalization performance in classification (?; ?).
Formulation
The two most usual statistics for characterizing the margin distribution are the first- and second-order statistics, that is, the mean and the variance of margin. According to (4), the margin mean of training samples is
[TABLE]
and the margin variance is
[TABLE]
Intuitively, we attempt to maximize the margin mean and minimize the margin variance simultaneously in ordinal embedding problem (LABEL:eq:sdp_oe).
First, there is a straightforward idea to achieve our goal as considering the margin mean (8) and the margin variance (9) in (LABEL:eq:sdp_oe) explicitly. Although (LABEL:eq:sdp_oe) can adopt different loss functions, we will focus on SVM formulation (7) because the hinge loss is a natural form of margin. Considering the margin distribution, the optimization problem (7) can be formulated as
[TABLE]
where and are the trade-off parameters for balancing the impacts of and . It is apparent that GNMDS (7) is a degenerate case of (10) when and equal to [math]. However, there exists an obvious drawback of (10) with directly optimizing the margin distribution: tuning the parameters, and , is an obstacles of solving (10) efficiently. Therefore, a new lightweight formulation is proposed to optimize margin distribution implicitly.
Recall that SVM fixes the minimum margin as by scaling the margin with the norm of linear predictor. Following the similar way, we can scale the margin of in ordinal embedding (4) and set the margin mean as a constant. This would not result in a sub-optimal solution because the ordinal constraints can only determine an embedding up to the monotonic transformations. Without loss of generality, the mean of can be set as a constant and an equality constraint is conducted
[TABLE]
On the other hand, we want to minimize the variance of . By (11), the deviation of to the margin mean is , and we force the deviation to be smaller than as
[TABLE]
Thus, minimizing is equivalent to minimize the margin variance (9). Meanwhile, (12) implies the margin mean constraint (11).
Constraints like (12) in optimization problems always involve two inequality, ,
[TABLE]
Note that the soft-margin constraint
[TABLE]
plays the same role as (13b). Replacing (13b) with (14) and adding (13) into (10), we arrive at the following formulation
[TABLE]
This optimization problem corresponds to dealing with such a loss function,
[TABLE]
The trading-off parameter in (15) can capture the asymmetry between the sign correctness and the dispersion of . When and ignoring the semi-definite and rank constraints, (15) is similar to the support vector regression (SVR) (?). In SVR, the -insensitive loss
[TABLE]
produces two slack variables and for each training example to guard against outliers. As , the loss function (16) is explicitly the same loss function adopted by SVR as . In our formulation (15), and conduct the similar constraints like SVR but have totally different meanings. All the training examples are used to learn the margin distribution in (15), but the optimal solution of SVR is only spanned by the support vectors which is sparse in the training data. Figure 1 depicts the situations of the learned margin distribution graphically. Some theoretical results are provided at the end of this section.
Theorem 1**.**
Suppose that
[TABLE]
and the true Gram matrix . Let be a solution of (15). With probability at least , it holds that
[TABLE]
where is the risk, as for any
[TABLE]
. Here the expectation respects to both the uniformly random selection of the quadruplet and its label .
Theorem 1 says that must scale like which leads to the bounded error . This result is consistent with known finite sample bounds (?). The details are provided in the supplementary materials. The generalization bound of margin distribution is a future direction.
Optimization
Consider that is a linear operator on , has its symmetric matrix form in , the positive semi-definite cone of symmetric matrix. Given a ordinal constraints , is the matrix form of where
[TABLE]
and
[TABLE]
With the trick that
[TABLE]
we note
[TABLE]
where , is a -dimension vector with all entries are and is the Hadamard product. Furthermore, we introduce the redundant variables to make the objective separable which can be solved by the ALM framework efficiently. The optimization is converted into:
[TABLE]
where is the set of all the parameters to be solved and is the nuclear norm which is the convex surrogate of matrix rank constraints. It is worth mentioning that (23) is a convex optimization problem as the feasible set of each constraint is a convex set and the objective function is convex. The Lagrange function of (23) can be written in the following form:
[TABLE]
with
[TABLE]
is norm for vector and the Frobenius norm for matrix. In addition, and are Lagrange multipliers. is the Dirac delta function whose function value would be infinity if the condition is not satisfied. Below are the solutions to each sub-problem.
** sub-problem. **With the variables unrelated to fixed, we have the sub-problem of :
[TABLE]
where
[TABLE]
It’s worth noting that is a piece-wise linear function. Thus, to seek the minimum of each element in , we just need to pick the smaller value between and [math]. The solution of (25) is
[TABLE]
where is an indicator function as and is the complementary support of . The definition of shrinkage operator on scalars is and it is an element-wise operator for vector and matrix.
** sub-problem. **Similarly, picking out the terms related to gives the following sub-problem:
[TABLE]
where
[TABLE]
and the solution of sub-problem is just replaced the with in (26).
** sub-problem. **Dropping the terms independent on leads to the following problem:
[TABLE]
We have
[TABLE]
and note the right-hand side as , we have
[TABLE]
and is the matrix form of .
** sub-problem. ** There are two terms in (24) involving . The associated optimization problem of is
[TABLE]
and solving this problem yields
[TABLE]
where
[TABLE]
and is the shrinkage operator.
** sub-problem. ** Considering the potential asymmetric of , we claim that is the nearest symmetric positive semi-definite matrix of in Frobenius norm (?). By the following theorem, we show the explicit solution of .
Theorem 2**.**
Suppose that , and let , be the symmetric and skew-symmetric parts of respectively. If we do polar decomposition of as where is orthogonal matrix and is positive semi-definite matrix, is the unique approximation of in the Frobenius norm with positive semi-definite constraint, and the distance in the Frobenius norm from to is
[TABLE]
where is the eigenvalue of .
Consequently, the explicit solution of is
[TABLE]
where is the square root of , and , is a diagonal matrix and is element-wise square root of .
For clarity, the procedure of solving (23) is outlined in Algorithm 1. The algorithm would not be terminated until the change of objective value in two successive iterations is smaller than a threshold (in the experiments, is the default setting).
Empirical Study
In this section, we show the results of simulations and real-world data experiments to demonstrate the effectiveness of the proposed algorithms. As the existed margin-based ordinal embedding methods, such as GNMDS, STE and TSTE, just use triple-wise comparisons as the ordinal constraints, we treat triple-wise comparisons as the input of the proposed algorithm for fair competition. The triple-wise comparisons is a special case of quadruplets which means in . The of triplet is also a symmetric matrix indicated by as
[TABLE]
Replacing with in those sub-problems in Algorithm 1, the proposed DMOE method could handle the triplets set as the ordinal constraints. The reproducible code can be found here111https://github.com/alphaprime/DMOE.
Simulation
**Settings. **The synthesized dataset consists of points , where , is the identity matrix. The possible similarity triple comparisons are generated based on the Euclidean distances between . We randomly sample triplets as the training set and the test set is the rest of all triplets. The embedding dimension is fixed to .
**Evaluation Metrics. **We employ the generalization error to evaluate generalization ability of various algorithms. As the learned Gram matrix from partial triple comparisons set may be generalized to unknown triplets, the percentage of held-out triplets which is not satisfied in the is the generalization error of the learned embedding.
Competitors. We compare the proposed algorithm with three well-known ordinal embedding methods: GNMDS (?), STE and TSTE (?). Note that we adopt the optimization strategy proposed by (?), which performs gradient descent with line search, and projects the Gram matrix onto the subspace spanned by the top eigenvalues at each step (i.e. setting the smallest eigenvalues to [math]). We call the three competitors: GNMDS-, STE- and TSTE-, correspondingly. The optimization problem of GNMDS is (7). STE replaces the hinge loss by logistic loss in (7) and adopts Gaussian kernel to predict the label:
[TABLE]
where . TSTE employs the heavy-tailed Student-t kernel:
[TABLE]
The regularization parameters of the competitors are tuned for the best performance under the different settings.
**Results. **From Figure 2 and Table 1, the following phenomena can be observed. First of all, the generalization ability of all methods would be improved when the number of training samples increases. The decrease of standard derivation also improves the stability. Moreover, the proposed algorithm shows better generalization performance than the traditional methods in all four settings. Compared with GNMDS-/STE-/TSTE- which need more training samples, our method can achieve better results with fewer training samples. This is our main motivation to optimize the margin distribution instead of maximizing the minimum margin like the classic methods. Third, the results of GNMDS- verifies that only maximizing the minimum margin would not necessarily lead to better generalization performances as the STE- is better than GNMDS when train samples are few.
Music Artist Data
Settings. The music artist data is collected by (?) via a web-based survey in which users provided triplets on the similarity of music artists. We use the data pre-processed by (?) which includes only triplets for artists. The size of training samples is variant from to and the rest of triplets are treated as test set. The desired dimension of embedding is as these music artists can be classified by genre into categories.
**Results. **According to the experimental results, Figure 3 and Table 2, we have the following observations. DMOE still shows better prediction result than GNMDS-/STE-/TSTE- with the same number of noisy training samples. To achieve the same generalization error, DMOE needs the smallest number of training samples and STE-/TSTE- need five times more than DMOE. This real-world data experiment verifies the proposed method, DMOE, has strong generalization for ordinal embedding with small training samples. Although this dataset contains noise triplets and it is well-known that the calculation of mean and the variance is sensitive, the proposed method show the same magnitude of standard deviation and its results are not damaged by the potential wrong training samples. The robustness is still an open problem in ordinal embedding, and this is our future work.
Conclusion
The classical ordinal embedding algorithms always need a large number of labeled data to predict unknown similarity relationship among items from learned embedded points. As collecting high-quality, large-scale labeled data from human is a hard task, generalization ability is the main challenge when we could only access small numbers of relative comparisons. Incorporating margin distribution learning paradigm gives birth to a novel algorithm for ordinal embedding, namely DMOE. Comprehensive experiments on synthetic dataset and real-world dataset validate the superiority of our method to traditional methods which need more training data to achieve the same generalization.
Acknowledgment
The research of Ke Ma and Xiaochun Cao is supported by the National Key R&D Program of China (Grant No. 2016YFB0800603), the Key Program of the Chinese Academy of Sciences (No. QYZDB-SSW-JSC003) and the National Natural Science Foundation of China (No.U1636214, U1605252, 61733007). The research of Qianqian Xu is supported in part by the National Natural Science Foundation of China (No.61672514, 61390514, 61572042), the Beijing Natural Science Foundation (4182079), the Youth Innovation Promotion Association CAS, and the CCF-Tencent Open Research Fund.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Agarwal et al . 2007] Agarwal, S.; Wills, J.; Cayton, L.; Lanckriet, G. R.; Kriegman, D. J.; and Belongie, S. 2007. Generalized non-metric multidimensional scaling. International Conference on Artificial Intelligence and Statistics 11–18.
- 2[Amid and Ukkonen 2015] Amid, E., and Ukkonen, A. 2015. Multiview triplet embedding: Learning attributes in multiple maps. International Conference on Machine Learning 1472–1480.
- 3[Arias-Castro 2017] Arias-Castro, E. 2017. Some theory for ordinal embedding. Bernoulli 23(3):1663–1693.
- 4[Borg and Groenen 2003] Borg, I., and Groenen, P. 2003. Modern multidimensional scaling: theory and applications. Journal of Educational Measurement 40(3):277–280.
- 5[Bradley and Terry 1952] Bradley, R. A., and Terry, M. E. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39(3/4):324–345.
- 6[Drucker et al . 1997] Drucker, H.; Burges, C. J.; Kaufman, L.; Smola, A. J.; and Vapnik, V. 1997. Support vector regression machines. In Advances in Neural Information Processing Systems , 155–161.
- 7[Ellis et al . 2002] Ellis, D. P.; Whitman, B.; Berenzweig, A.; and Lawrence, S. 2002. The quest for ground truth in musical artist similarity. International Society for Music Information Retrieval Conference .
- 8[Gao and Zhou 2013] Gao, W., and Zhou, Z. 2013. On the doubt about margin explanation of boosting. Artificial Intelligence 203:1–18.
