Informative extended Mallows priors in the Bayesian Mallows model
Marta Crispino, Isadora Antoniano-Villalobos

TL;DR
This paper introduces a new method for eliciting informative priors in the Bayesian Mallows model with Spearman's distance, moving beyond the traditional uniform prior to incorporate subjective beliefs, thereby enhancing inference quality.
Contribution
It proposes a novel strategy for prior elicitation in the Bayesian Mallows model, clarifying hyper-parameter interpretation and impact on posterior analysis.
Findings
New prior elicitation method developed
Hyper-parameter interpretation clarified
Implications for posterior inference discussed
Abstract
The aim of this work is to study the problem of prior elicitation for the Mallows model with Spearman's distance, a popular distance-based model for rankings or permutation data. Previous Bayesian inference for such model has been limited to the use of the uniform prior over the space of permutations. We present a novel strategy to elicit subjective prior beliefs on the location parameter of the model, discussing the interpretation of hyper-parameters and the implication of prior choices for the posterior analysis.
| (1,2,3,4) | 260 | 2 | 0.029 | 0.038 | 0.050 | 0.053 | 0.053 | 0.050 |
| (1,2,4,3) | 230 | 4 | 0.172 | 0.125 | 0.080 | 0.052 | 0.050 | 0.036 |
| (1,3,2,4) | 310 | 6 | 0.007 | 0.003 | 0.003 | 0.004 | 0.004 | 0.004 |
| (1,3,4,2) | 250 | 10 | 0.049 | 0.010 | 0.005 | 0.004 | 0.004 | 0.003 |
| (1,4,2,3) | 330 | 12 | 0.004 | 0.001 | 0.001 | 0.002 | 0.002 | 0.001 |
| (1,4,3,2) | 300 | 14 | 0.007 | 0.002 | 0.001 | 0.002 | 0.001 | 0.001 |
| (2,1,3,4) | 250 | 0 | 0.048 | 0.129 | 0.257 | 0.417 | 0.436 | 0.546 |
| (2,1,4,3) | 220 | 2 | 0.367 | 0.579 | 0.527 | 0.410 | 0.386 | 0.303 |
| (2,3,1,4) | 350 | 8 | 0.003 | 0.001 | 0.001 | 0.002 | 0.002 | 0.002 |
| (2,3,4,1) | 260 | 14 | 0.029 | 0.005 | 0.003 | 0.002 | 0.002 | 0.002 |
| (2,4,1,3) | 370 | 14 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| (2,4,3,1) | 310 | 18 | 0.006 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| (3,1,2,4) | 290 | 2 | 0.009 | 0.010 | 0.015 | 0.017 | 0.023 | 0.022 |
| (3,1,4,2) | 230 | 6 | 0.169 | 0.065 | 0.032 | 0.016 | 0.017 | 0.012 |
| (3,2,1,4) | 340 | 6 | 0.003 | 0.002 | 0.002 | 0.003 | 0.003 | 0.003 |
| (3,2,4,1) | 250 | 12 | 0.049 | 0.007 | 0.004 | 0.002 | 0.002 | 0.002 |
| (3,4,1,2) | 380 | 18 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| (3,4,2,1) | 350 | 20 | 0.003 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| (4,1,2,3) | 300 | 6 | 0.007 | 0.005 | 0.004 | 0.004 | 0.004 | 0.004 |
| (4,1,3,2) | 270 | 8 | 0.019 | 0.007 | 0.006 | 0.004 | 0.004 | 0.004 |
| (4,2,1,3) | 350 | 10 | 0.003 | 0.002 | 0.001 | 0.001 | 0.002 | 0.001 |
| (4,2,3,1) | 290 | 14 | 0.009 | 0.002 | 0.002 | 0.001 | 0.001 | 0.001 |
| (4,3,1,2) | 370 | 16 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| (4,3,2,1) | 340 | 18 | 0.003 | 0.001 | 0.001 | 0.001 | 0.001 | 0.000 |
| Prop. | ||||||||
|---|---|---|---|---|---|---|---|---|
| ACDEB | 0.337 | 0.047 | 0.055 | 0.062 | 0.076 | 0.105 | 0.078 | 4 |
| ADCEB | 0.184 | 0.037 | 0.044 | 0.050 | 0.060 | 0.129 | 0.129 | 2 |
| ACDBE | 0.122 | 0.031 | 0.035 | 0.041 | 0.048 | 0.095 | 0.103 | 2 |
| ADCBE | 0.082 | 0.025 | 0.030 | 0.034 | 0.041 | 0.114 | 0.176 | 0 |
| ACEDB | 0.061 | 0.022 | 0.028 | 0.024 | 0.025 | 0.018 | 0.015 | 10 |
| CADEB | 0.051 | 0.025 | 0.028 | 0.030 | 0.026 | 0.023 | 0.019 | 8 |
| ADECB | 0.051 | 0.015 | 0.019 | 0.021 | 0.020 | 0.020 | 0.020 | 6 |
| Sushi item | oil | eat | price | sell |
|---|---|---|---|---|
| shrimp | 2.73 | 2.14 | 1.84 | 0.84 |
| sea eel | 0.93 | 1.99 | 1.99 | 0.88 |
| tuna | 1.77 | 2.35 | 1.87 | 0.88 |
| squid | 2.69 | 2.04 | 1.52 | 0.92 |
| sea urchin | 0.81 | 1.64 | 3.29 | 0.88 |
| salmon roe | 1.26 | 1.98 | 2.70 | 0.88 |
| egg | 2.37 | 1.87 | 1.03 | 0.84 |
| fatty tuna | 0.55 | 2.06 | 4.49 | 0.80 |
| tuna roll | 2.25 | 1.88 | 1.58 | 0.44 |
| cucumber roll | 3.73 | 1.46 | 1.02 | 0.40 |
| Sushi item | oil | eat | price | sell |
|---|---|---|---|---|
| shrimp | 9 | 2 | 6 | 6.5 |
| sea eel | 3 | 5 | 4 | 3.5 |
| tuna | 5 | 1 | 5 | 3.5 |
| squid | 8 | 4 | 8 | 1 |
| sea urchin | 2 | 9 | 2 | 3.5 |
| salmon roe | 4 | 6 | 3 | 3.5 |
| egg | 7 | 8 | 9 | 6.5 |
| fatty tuna | 1 | 3 | 1 | 8 |
| tuna roll | 6 | 7 | 7 | 9 |
| cucumber roll | 10 | 10 | 10 | 10 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDecision-Making and Behavioral Economics · Game Theory and Voting Systems · Economic and Environmental Valuation
Informative extended Mallows priors in the Bayesian Mallows model
Marta Crispino
Mistis team, Inria Grenoble Rhône-Alpes, Inovallée 655 avenue de l’Europe, Montbonnot France.
Isadora Antoniano-Villalobos
Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy and Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, Italy.
Abstract
The aim of this work is to study the problem of prior elicitation for the Mallows model with Spearman’s distance, a popular distance-based model for rankings or permutation data. Previous Bayesian inference for such model has been limited to the use of the uniform prior over the space of permutations. We present a novel strategy to elicit subjective prior beliefs on the location parameter of the model, discussing the interpretation of hyper-parameters and the implication of prior choices for the posterior analysis.
Keywords — Bayesian subjective inference, conjugate priors, Mallows model for rankings, ranking data, permutations, permutohedron
1 Motivation
In recent years, interest in preference data has increased, in part due to internet-related activities. The study of rankings, in particular, has received special attention, since this type of data arise in many fields. Notable examples are electoral systems in which voters are required to rank candidates, as is the case of the Irish general elections (Gormley and Murphy, 2008); automatic recommender systems seeking to aggregate preferences in order to suggest products to the customers (Sun et al., 2012); market research based on surveys in which competing services, or items, are compared or ranked by customers (Dabic and Hatzinger, 2009); medical applications, specially in genomics, in which genes are sometimes ranked according to their expression levels under various experimental conditions (Vitelli et al., 2018), and other data is often transformed into rankings in order minimize the effect of miscalibration error from the measuring devices (Mollica and Tardella, 2014). In a coherent analysis of ranking data, the quantification of uncertainty regarding the estimated quantities is a fundamental aspect of decision making. It could, for instance, allow for actions grounded on unreliable estimates to be deferred until more data are available.
The Mallows model (MM) (Mallows, 1957; Diaconis, 1988) is a popular two-parameter distance-based family of models for ranking data, based on the assumption that a modal ranking, which can be interpreted as the consensus ranking of the population, exists. The probability of observing a given ranking is then assumed to decay exponentially fast as its distance from the consensus grows. Individual models with different properties can be obtained depending on the choice of distance on the space of permutations. The scale or precision parameter, controlling the concentration of the distribution, determines the rate of decay of the probability of individual ranks.
We focus on the Mallows model with Spearman’s distance (MMS), introduced by (Mallows, 1957) with the name of rho-model, since Spearman’s distance, when re-scaled to lie between and , arises naturally as the correlation between the ranks of two samples. Fligner and Verducci (1990) and Vitelli et al. (2018) have studied Bayesian inference for the MMS, limiting the analysis to the use of a uniform prior on the consensus ranking. As we discuss in Section 3.1, this can be interpreted as a non-informative prior.
Within the Bayesian literature, non-informative and objective priors have attracted much attention in the search for standard go-to procedures when prior information is unavailable. They can also be used to provide a sense of neutrality to the analysis by allowing the data to be the only source of information in the estimation procedure. However, when information is available from experts or external sources, it may be argued that a fully Bayesian analysis should include this subjective prior belief. Dawid (1997) clearly stated that “no theory which incorporates non-subjective priors can truly be called Bayesian, and no amount of wishful thinking can alter this reality”. While admitting that both approaches may be valid in different situations, in this paper we explore the possibility of including genuine prior information, which might come from a literature review, from an expert or from an earlier data analysis, into the Bayesian Mallows model for ranking data (Mallows, 1957; Vitelli et al., 2018).
Previous proposals to include prior information on the consensus ranking of a MM include Gupta and Damien (2002), who suggest eliciting a prior on the consensus which is constant on conjugacy classes. In other words, they propose a prior that assigns a priori equal probability to all permutations with the same cyclic structure. However, the conjugate classes defined by cyclic structures do not coincide with those defined by permutations lying at the same distance (e.g. Spearman’s) from the consensus ranking, making this approach impractical for the MMS, as it is difficult to assess a way in which prior information enters the model. Meilǎ and Bao (2010) and Meilǎ and Chen (2010) consider the MM with Kendall’s distance within the Bayesian paradigm and provide a conjugate prior for the model parameters which is known up to a normalization constant. However, their analysis does not extend to the MMS. Xu et al. (2018) propose an alternative family of models for rankings, based on a mapping of the data to the unit sphere (see also McCullagh, 1993). The location parameter of their model has an interpretation analogous to that of the consensus ranking but it is not limited to be itself a ranking, thus allowing to express a more general form of consensus. The MMS is a particular case of this model, and the authors propose a conjugate Bayesian prior for the consensus parameter. However, the emphasis of the paper is on efficient inference via an approximation of the model’s normalizing constant and the use of variational methods; prior elicitation and the inclusion of prior information are not discussed.
In the present work, which stems from Chapter 6 of Crispino (2017), we aim to provide experts using the MMS with a tool to express their beliefs, knowing the effect of prior choices in their analysis, should they wish to do so. With this in mind, by exploiting the notion of permutohedron, also known as permutation polytope, (Thompson, 1993; McCullagh, 1993; Marden, 1995), we find an explicit form for a conjugate prior on the consensus parameter for the MMS. We then study its properties, presenting some theoretical insights on the prior elicitation problem. Subjective prior information on the consensus ranking can therefore be elicited by choosing proper hyper-parameters. In doing this, we initially assume the scale parameter of the MMS to be known, given that in most applications it is considered a nuisance, the interest being focused on the estimation of the consensus ranking (see Vitelli et al., 2018, Section 3). The proposed prior density can handle a situation when only partial information is available, which is particularly relevant when the set of items to be ranked is very large. In such cases it is unlikely that a full ranking is a priori available, while it could be possible to express some prior belief regarding which are the most (or least) preferred items. An additional advantage of our prior is given by the interpretability of the hyper-parameters in terms of the amount and type of information included.
The paper is organized as follows. In Section 2 we give an overview of the MMS. In Section 3 we discuss the novel results regarding the conjugate prior for the consensus parameter of the MMS, initially assuming the dispersion parameter to be known (Section 3.1), then (Section 3.2) working with both parameters unknown. In Section 4 we sketch the MCMC algorithm used to perform inference on our model, and in Section 5 we illustrate the inference on simple examples, exploiting both simulations and benchmark datasets. We conclude with some final remarks in Section 6.
2 Preliminaries
A (full) ranking of items, or -ranking is defined as a map from a finite set, , of labeled items to the space of -dimensional permutations. A ranking can, therefore, be represented by a vector , where is the rank assigned to item according to some criterion. Formally, individual ranks are ordinal numbers, so that when item is preferred to (ranked lower than) item . Alternatively, rank data may be represented through orderings, which are ordered vectors of labels. Clearly, there is a one-to-one relationship between the two representations, e.g. a possible ranking of the set is , corresponding to the ordering . Since the ranking vector representation has many advantages in terms of modelling, we will stick to it throughout the paper, and only use the orderings when necessary for illustrative purposes. Given the trivial one-to-one relation between ordinal and cardinal numbers, with a slight abuse of notation, one may consider -rankings as -dimensional vectors obtained by permuting the first natural numbers, . It is then easy to see that is contained in a -dimensional affine subspace of . In fact, it is composed by the points on the intersection between the hyper-plane with coordinate sums equal to and the surface of an -dimensional sphere of squared radius centered at the origin. Thus, all the points of lie on an -dimensional sphere of squared radius centered at , where denotes the vector with all entries equal to (McCullagh, 1993).
The Mallows model for ranking data (Mallows, 1957) defines the probability that a random -ranking takes a value as
[TABLE]
where is a location parameter representing the shared consensus ranking and is a scale parameter describing the concentration of the mass around the shared consensus. Different families of models are obtained through different choices of the right-invariant (Diaconis, 1988) distance on . Right-invariance (see also Definition 3 in the Appendix), which ensures that distances are independent of any relabeling of the items, is an important property in this context, as it ensures that the partition function of the MM does not depend on (Mukherjee, 2016; Vitelli et al., 2018). In the above expression denotes the identity permutation. Nevertheless, the number of terms in the sum makes direct calculation of this partition function unfeasible for all but very small values of . Therefore, the MM is considered known up to a proportionality constant only, except for some particular choices of the distance, for which may have a closed form (Fligner and Verducci, 1986). Different approximation strategies have been proposed (see e.g. McCullagh, 1993; Mukherjee, 2016; Vitelli et al., 2018), allowing inference even with a large number, , of items. Notice that the distance function induces a partition of formed by sets of rankings which are equidistant from . Within each partition set, the MM assigns equal probability to all rankings. As a consequence, exact computation of the partition function is possible for moderate , for some choices of for which the cardinalities of the partition sets are known (see e.g. Irurozki et al., 2016; Vitelli et al., 2018). The partitions of associated to Spearman’s distance play a crucial role in understanding the behavior of the prior proposed here for the MMS.
In this work we focus on the Mallows model with Spearman’s distance, given by for , which was first introduced with the name rho-model by Mallows (1957). Notice that Spearman’s distance is an unnormalized version of the Spearman’s rank correlation, used to measure the statistical correlation between the ranks of two variables, but, when rankings are considered as vectors in , it is simply the squared Euclidean distance, or -norm. Therefore, we say that a random ranking follows an MMS distribution, denoted by , if its probability mass function is given by
[TABLE]
where does not have a closed form. Notice, however, that when , the MMS reduces to the uniform distribution on .
Given a sample , the likelihood function takes the form
[TABLE]
Therefore, for , the maximum likelihood estimator (MLE) is given by
[TABLE]
where the dot denotes the scalar product on , and is the sample mean vector of , . This is not surprising as the kernel of the MMS distribution coincides with that of an -dimensional Gaussian distribution, except it has a finite support. In other words, the MMS is the restriction of the -dimensional gaussian to . Clearly, if , then the MLE simply coincides with the sample mean. In general, however, so a further consideration is required in order to solve the optimization problem.
Definition 1**.**
The permutohedron of order , , is an -dimensional polytope embedded in an -dimensional space, the vertices of which are formed by permuting the coordinates of the vector . Equivalently, it is the convex hull of the points .
The set is sometimes called the permutation polytope (see e.g. Thompson, 1993; Marden, 1995). This term, however, refers also to to a similar polytope whose vertices follow a different order. We, here, use the term permutohedron to avoid ambiguity.
Definition 2**.**
Let , such that for all . The rank defined by , is called the rank vector of , where denotes the indicator function taking the value 1 if the event is true, and 0 otherwise.
By the definition of convex hull, for any set of rankings, . The following proposition shows that whenever for all .
Proposition 1**.**
Let , and assume that is such that , for each . Then .
In order to clarify ideas, consider three samples of size of -dimensional rankings, with sample means , and , respectively. All three sample mean vectors have the same rank vector and, consequently, lead to the same MLE of the consensus ranking, . Notice, however, that while the rank vector transformation is formally correct, ensuring that the MLE is a proper ranking, it entails a loss of information. Intuitively, looking at the three sample means, one would attach greater uncertainty to the MLE obtained from the third sample, even if this information is lost when looking at the corresponding rank vector, as it is known that a point estimate alone does not provide an uncertainty assessment. Definition 2 can be generalized to the case of a vector with ties, i.e., for any , by letting be any ranking whose elements satisfy the same ordering relation as those of . However, the ranking vector in this case would not be unique and so, a unique would not exist. It would be possible, for instance, to obtain a fourth sample with sample mean . Then, any ranking in would be a rank vector for . This corresponds to a flat likelihood function, for which there is no MLE. Intuitively, any permutation has the same likelihood of being the consensus ranking. This idea is related to the spread or variability of the sample, which in turn is associated to the concentration of the MMS distribution around , in other words, to the precision parameter . This is, again, not surprising, considering the relation of the MMS with the Gaussian distribution, highlighted above.
It follows that, even if in most applications is considered a nuisance parameter and the main interest is in the estimation of , it is nevertheless necessary to estimate in order to get an idea of the reliability of the estimate. A MLE for can be found as the solution to the equation
[TABLE]
assuming a unique exists. This can be done numerically, for instance, via a Newton-Raphson algorithm (see e.g. Marden, 1995), but the calculations may be cumbersome for all but small values of .
The Bayesian paradigm is then a natural solution for making inference on the MMS, not only for quantifying uncertainty, but also for including prior information into the statistical analysis. In the remainder, we propose and study an informative prior density, specifically tailored to the MMS, building on the Bayesian Mallows model for ranking data of Vitelli et al. (2018).
3 An informative prior
This section is devoted to the proposal of a prior distribution for the parameters of the MMS. In Section 3.1 we analyze the simpler case in which the precision parameter is assumed known. Then, in Section 3.2, we give an intuition on how to deal with the more general and realistic case of unknown .
3.1 Known precision parameter
For fixed , the likelihood (3) can be simplified as
[TABLE]
Therefore, a conjugate prior for is given by
[TABLE]
We call this, the Extended Mallows Model with Spearman distance (EMMS) and write . The two parameters and can be interpreted as precision and location parameters, respectively, analogous to those of the MMS. In particular, determines the concentration of the distribution around with corresponding to a uniform prior on , while larger values reflect stronger prior belief on . Notice, however, that the modal parameter cannot be interpreted, in general, as a consensus ranking, except when , in which case the EMMS simply reduces to a MMS. Recall that Mallows models have the limitation that all rankings which are equidistant (in terms of the distance in (1)) from the consensus ranking have the same probability. For the MMS, this implies in particular that it is not possible to freely assign different masses to different rankings at the same Spearman’s distance to the consensus ranking. By allowing the modal parameter to take any value in the permutohedron , that is, to be any convex combination of the elements of , such structure can be broken, allowing for a more flexible distribution of the mass. In fact, the prior (5) assigns equal mass to all permutations that lie at the same -norm from , and greater mass is given to permutations closest to . For instance, consider the EMMS centered at the barycenter of the permutohedron, that is, with . This results in a uniform distribution on rankings for any value of the precision parameter . Small deviations from uniformity can be achieved by letting and be small. The direction of the vector in determines the rankings for which the mass increases and those for which it decreases.
The case described above, where , is therefore equivalent to assigning to the uniform prior on , , like in (Fligner and Verducci, 1990; Vitelli et al., 2018). As Berger et al. (2012) discuss, the natural reference prior for a discrete parameter taking values on a finite support is usually the uniform prior on the parameter space. However, the authors show that the uniform prior, in some cases, may not be objective, which may be a property sought by the analyst. The following result, which holds for the MM with any right-invariant distance, shows that the uniform prior on all rankings, which corresponds to prior (5) with (i) and , or with (ii) and , is formally the objective prior in the sense of Villa and Walker (2015). The authors propose that an objective prior for a discrete parameter, in this case the mode of the MMS, should assign to each possible value a mass proportional to the minimum Kullback-Leibler divergence between the model with parameter value and the model with any other parameter value, say . In this way, prior mass is associated to the “worth” of each possible parameter value, defined as the loss in information that would derive from assigning prior probability zero to such value, if it was true.
Proposition 2**.**
For any right-invariant distance , the objective prior in the sense of Villa and Walker (2015) for the MM is the uniform prior on the space of permutations, .
Note that may not be a permutation, so that the partition function in (5),
[TABLE]
in general depends on . This implies that (5) is known up to a normalization constant. However, in the following sections we show that this drawback can be overcame in practice. Moreover, the fact that does not need to be a permutation is very convenient for the elicitation problem. The cases where (i) we are interested in including partial information about the consensus ranking, or when (ii) multiple experts’ opinions are available, are naturally handled by this prior. For instance, imagine a simple example of case (i), where we have some information only on the top- ranked items. We can define by fixing the top- items’ ranks , and giving the same uniform value to the remaining bottom- items’ ranks. An example of case (ii) is to assume that two (or more) experts believe, a priori, in different modal rankings, say and . An analyst wishing to express an equally strong prior on such two rankings may simply use the prior (5) with .
A possible reparametrization of (5) is obtained by letting , which allows to further understand the role of the precision parameter for prior elicitation. One may imagine eliciting prior information from an expert who believes that the consensus ranking is given by some and expresses a degree of uncertainty in this belief through some precision, say . Within scenario (ii), one may imagine that the analyst, having encountered a number of experts who coincide with this view, wishes to summarize this aggregated information by increasing the prior precision. This is achieved by letting , thus expressing that individual prior belief regarding is reinforced by various experts. In the limit, infinite prior precision may correspond to a single expert with extremely strong prior belief () or to an extremely large number of experts with some prior belief (). Intuitively, one may imagine a situation in which the analyst aggregates prior opinions from many experts by calculating and as convex and linear combinations, respectively, of the individual parameters elicited from each expert , and considering the number of experts who agree on both. In order to understand how this could be done, one may consider how the information passes on from the prior to the posterior.
The posterior density for is given by
[TABLE]
The first thing we observe is that the proposed prior is indeed conjugate. In other words, if and , then it holds that , with updated parameters:
[TABLE]
In particular, the prior hyper-parameters elicited under scenario (ii) can be interpreted as the posterior parameters obtained by an expert who observes instances of , where is the known precision of the true data distribution according to the MMS. Clearly, this interpretation excludes any value of that is not a multiple of . Nevertheless, such exercise helps to provide an intuition of the role of the prior hyperparameters. Notice that, since any can be expressed as a convex combination of rankings in , the prior mode elicited in scenario (ii) can always be interpreted as arising from multiple (possibly infinite) experts, the calculation of the individually elicited parameters being an exercise in linear algebra. The prior precision parameters for which this interpretation is valid, however, are limited. Therefore, in the case in which is assumed known, an interesting case arises by setting , i.e. . can be interpreted as an a priori sample size, representing the amount of information on which an expert bases the prior belief about the central tendency of . In this sense, the posterior consensus parameter can be viewed as a weighted average of the prior hyper-parameter and the observed mean value , with weights proportional to the corresponding sample sizes. For any finite prior precision , as the sample size increases, the posterior accumulates mass around , which approaches the sample mean, . Some insights into the role of the prior hyper-parameters can be obtained by considering limiting situations. An infinite prior precision would express a priori certainty, by accumulating all the prior mass on , a choice that would make sense only for . The posterior would maintain the infinite precision thus accumulating mass on . In such hypothetical case, learning would be possible only for infinite sample sizes, with
[TABLE]
Notice that, by Proposition 1, the maximum a posteriori (MAP) of is unique and given by provided that all the coordinates of the vector take different values. Furthermore, as the sample size grows, thus increasing posterior precision.
The prior (5) has a shape which is analogous to the one discussed earlier by Gupta and Damien (2002). In their paper, however, the authors propose the use of the Hausdorff distance among subsets (conjugacy classes) of , in place of the squared -norm between a ranking and the location parameter of the prior (5), which is an element of the permutation polytope. This difference implies that the proposal of Gupta and Damien (2002) assigns equal probability to all permutations within a conjugacy class. In particular, all rankings in the modal conjugacy class of the prior are assigned the same mass, even if information may not be available on all such rankings. Furthermore, two permutations in the same class are not necessarily close with respect to the distance used in the MM, which is a crucial element of the model specification. Our proposal, instead, is specifically tailored to the MMS, and gives the possibility to choose whether to give maximum prior weight to a unique permutation, or to more than one. At the same time, we note in the following Theorem that the results in Gupta and Damien (2002, Section 3.3) can be extended to our prior (5).
Theorem 1**.**
Let , and . Then:
- a)
for each , and given , the ranking will have higher posterior probability than if and only if
[TABLE]
where 2. b)
for each , if , will have lower posterior probability than . 3. c)
for each , if , will have higher posterior probability than if and only if . 4. d)
for each , if and , then will have higher posterior probability than .
The theorem, analogous to Gupta and Damien’s Theorem 2 and corollaries, gives an intuition on the behavior of the posterior density, by providing a relationship between and , that determines which rankings receive the highest posterior probabilities. In Section 5 we illustrate, through simulated data, some of the consequences of this theorem on the inference.
3.2 Unknown precision parameter
When is unknown, the Bayesian paradigm requires a prior on the pair of parameters . We here suggest to choose a joint prior of the form , where is the EMMS of eq. (5). Notice that the particular case of prior independence, , is achieved in practice by choosing the parameter independent of . Regarding the choice of some proposals are present in the literature, for instance an exponential density (Vitelli et al., 2018), or the conjugate prior of Fligner and Verducci (1990).
Alternatively, we suggest the use of the Jeffreys prior for , which, in some specific cases, has a closed form and may be an interesting alternative when no information on is available a priori. The following proposition holds for any MM with a right-invariant distance, and in particular for the MMS.
Proposition 3**.**
The Jeffreys prior for in a MM with right-invariant distance takes the form
[TABLE]
where denotes the variance with respect to , which depends on .
The posterior density of the model parameters, with the conjugate prior given in eq. (5) and possibly independent of , and with one of the three prior distributions for mentioned above is
[TABLE]
Eq. (12) can be easily evaluated in two cases: when (a) does not depend on , that is, when is independent of , or when (b) , and is small enough, so that can be calculated exactly, for given prior hyperparameters and .
The more problematic case (c) when and is too large for computing exactly, can be handled by using as prior density for , , so that the posterior density (12) can be written as
[TABLE]
In the next section we sketch the algorithms developed for inference on the MMS in both cases of known and unknown , within the situations (a), (b) and (c) described above.
4 Posterior simulation
Notice that, when is known, the posterior (7) is known up to a normalization constant. Posterior simulation is straightforward in this case and it basically reduces to a visualization problem because of the complexity of the space of permutations. In this simple case, we employ a Metropolis-Hastings (MH) Markov Chain Monte Carlo (MCMC) scheme for the update of . We propose according to the Leap and Shift distribution of Vitelli et al. (2018), which is an asymmetric proposal centered around the current value of . We then accept with probability , where
[TABLE]
where, , and denotes the transition probability of the Leap and Shift distribution. Notice that, for the sake of simplicity, we are considering the case , but the results follow trivially for other parametrizations.
When is not known, we implement a Metropolis within Gibbs scheme for posterior simulation. However, further considerations must be made for the different cases outlined in Section 3.2. First, we consider case (a), where is assumed a priori independent of , which amounts to eliciting of eq. (5) independently of ; in cases (b) and (c) the precision parameter of the EMMS takes the form .
In (a) is simply constant, so it creates no additional difficulty. Exact posterior inference can be performed when , that is, when we can compute exactly (see Vitelli et al., 2018). When posterior inference cannot be performed exactly, but we can exploit the efficient scheme of Vitelli et al. (2018, Algorithm 1), which targets an approximation of the posterior density. Only the acceptance probabilities of the two M-H steps are different here, due to the introduction of the non-uniform prior density on .
In cases (b) and (c) we have the additional issue of dealing with , for which different solutions are possible. In (b), that is for small , we can compute on a grid of values; whenever its evaluation is required within the M-H step for the update of , an approximate value can be obtained via interpolation for values of not in the grid. In this case we therefore have two steps. First, we update conditional on from the posterior full conditional (see eq.(12)),
[TABLE]
This is done as described above, that is, we propose according to the Leap and Shift distribution and accept it with probability , where is given in eq. (14), with equal to the current value of . Second, we update conditional on . Note that the posterior full conditional for is
[TABLE]
where . The proposal is sampled from a log-normal density centered on the current value of with a variance tuned in order to obtain a desired acceptance rate.
In (c), that is, for large values of , only the proposed prior for , and therefore its posterior full conditional, changes and it is given by
[TABLE]
Posterior simulation is therefore identical to that of case (b), with the obvious difference in the acceptance probability for .
5 Illustrative analyses
5.1 Simulation study
In this section we illustrate the effect of the prior on the posterior via a small simulated dataset. A small is used so that all possible permutations can be listed.
We generate a sample of rankings from from the MMS with given true parameters and . We then set the prior consensus to , and perform inference on the model in different settings corresponding to increasing prior sample size for the prior parametrization , and the Jeffreys prior for . The observed sample mean vector is , which leads to . We report in Table 1 the estimated posterior probability (EPP) of each of the rankings in . Notice that is the ranking with smallest value of (row highlighted in light-gray and with bold characters). Studying this table, we can verify that Theorem 1 holds. For instance, solving eq. (10) with and , we obtain that has a higher posterior probability than if and only if , which the empirical results confirm. Also, all rankings with have lower posterior probabilities than . Furthermore, if , then has a higher posterior probability than if .
We can also notice the following sensitivity behavior of the posterior probabilities: with increasing the rankings which are closer to (in terms of Spearman’s distance, or equivalently a smaller ) have increasing posterior probabilities, while those that are farthest, have decreasing posterior probabilities, even if the distance to the data is not so high. An example of this can be seen in the row corresponding to , which has and and for which increasing from 0 to 20 has the effect of decreasing the posterior probability from 0.169 to 0.012. The posterior means of in the six settings were 0.068, 0.074, 0.065, 0.06, 0.057, 0.055, while
5.2 idea dataset
For illustrative purposes, in this section we use the benchmark dataset idea (see e.g. Fligner and Verducci, 1990; Gupta and Damien, 2002). The data, collected by the Graduate Record Examination (GRE) Board, consist of a sample of rankings, each of them generated by a college student who was asked to rank words according to their strength of association with the target word ‘idea’. The five words are ‘thought’ (A), ‘play’ (B), ‘theory’ (C), ‘dream’ (D), and ‘attention’ (E). Our aim is to show the effect of our informative prior for on inference. Since is very small in this example, we can use the exact framework for posterior simulation outlined in Section 4, and choose the Jeffreys prior for the parameter , thus reflecting our lack of prior knowledge. In this example, we assume there is reason to believe that =(A,D,C,B,E) is the true ordering of association of the five words. We therefore choose the corresponding ranking vector as the prior mode. The choice of , interpreted as an equivalent sample size, reflects our confidence in , so we consider different settings, corresponding to increasing values of . Inference is carried out via MCMC posterior simulation, using a sample size of iterations, after a burn-in of , and the results are shown in Table 2. The orderings corresponding to the most frequently observed rankings in the dataset and their empirical frequencies or sample proportions are shown in columns 1 and 2 respectively, along with their estimated posterior probabilities (EPP) in the different settings (columns 3 to 8). In column 9 we report the Spearman distance between each of the top observed ranking and the prior mode (that is, ).
Recall that our prior (5) assigns equal mass to all rankings at the same Spearman distance from . This behavior has some analogies with the prior of Gupta and Damien (2002). However, while there is always a unique ranking at Spearman’s distance 0 from , each conjugacy class contains more than one ranking, all of which are assigned the same mass by the prior of Gupta and Damien (2002), henceforth GD. As we show below, this difference has a relevant effect on the posterior inferences based on our prior (5), when compared to the results by GD.
From this table we can notice the following:
- •
the EPP of (A,D,C,B,E), which corresponds to the prior mode (row 4), increases consistently with ; when , it becomes the posterior modal ranking;
- •
the ordering (A,C,D,E,B), corresponding to (row 1), remains the ranking with largest EPP provided that the equivalent sample size is not too large. In other words, if the prior does not assign too much mass to ;
- •
the relative ordering of the seven rankings in terms of posterior probability depends on , changing for large values which imply strong prior information.
Comparing our results with the findings of GD (Table 3), we notice that:
the posterior distribution of GD places most of the mass (about 0.93) on the top 6 rankings, thus penalizing all other rankings in ; 2. 2.
the EPP of the prior modal ranking with ordering (A,D,C,B,E), obtained by GD does not increase with the concentration parameter (in their paper denoted by ), but rather decreases (from 0.019 when , to 0.0067 when ). This is not in line with the expected behavior of an informative prior.
Our posterior distributions, instead, are generally flatter and, importantly, do not show the contradictory behavior with respect to the concentration parameter exhibited by the results of GD and which is probably a consequence of the complex structure of the conjugacy classes of .
5.3 The prior elicitation problem in practice
In this section we exploit covariates to provide an intuition of how to introduce available information in the prior elicitation problem. For the illustration we use the sushi benchmark data of Kamishima (2003), which consists of full rankings of different kinds of sushi items given by respondents according to their personal preference. The data are available at http://www.kamishima.net/sushi/. This dataset is particularly interesting because it includes covariates of the sushi items. We can therefore use this additional information to build an informative prior over the consensus ranking.
We begin from the elicitation of the consensus ranking hyper-parameter of eq. (5). We believe that the following covariates of the sushi items (see Table 3) may have an impact on the personal preference of the respondents:
oil: the oiliness in taste (measured on a 0-4 continuous scale, where the smallest the value is, the more oily is the sushi item); 2. 2.
eat: How frequently the sushi item is eaten in sushi shops (measured on a 0-3 continuous scale, where high values correspond to highly frequently sold); 3. 3.
price: the normalized price of the item; 4. 4.
sell: the frequency with which the sushi item is sold (measured on a 0-1 continuous scale, where high values correspond to highly frequently eaten).
Therefore, we may include the information contained in these four covariates into the analysis of the ranking data, through the following subjective reasoning: the more oily the sushi item is, the more it is preferred; a sushi item which is frequently eaten, is more likely to be preferred than one eaten less frequently; expensive items are preferred above cheaper ones; finally, items which are sold more, are known more and hence preferred. Clearly, the above assumptions are subjective, and someone else may decide to include these covariates differently (for instance, the price may play the opposite role). Table 4 shows the rank vectors obtained from these criteria by applying the rank transformation of Definition 2 to the covariate vectors of Table 3. Notice that the transformation does not result in a proper ranking for the sell variable (column 5): sea eel, tuna, sea urchin and salmon roe have the same covariate value (0.88 in Table 3), which results in a tied rank (3.5 in Table 4). Analogously, shrimp and egg, have the the same value (0.84) resulting a the tied rank (6.5). Nonetheless, the transformed vector for the covariate eat is indeed an element of the permutation polytope , and could therefore be a valid choice for the hyper-parameter . Another interesting feature of Table 4 is that the rankings induced by the different covariates (columns) are not equal but partially agree. A possible choice for the prior consensus hyper-parameter, which takes into account these four different rankings is to set it equal to the average of the rankings induced by the four covariates, that is, . Alternatively, one may consider the corresponding rank vector, .
The elicitation of the precision parameter, , requires a more qualitative reasoning. Considering the parametrization , we may decide to fix , since the consensus hyper-parameter comes from the average of four rankings, which may be interpreted as the opinions of four experts. At the same time, we may choose a relatively large value of , for instance (which is considered large, given the scale of the problem), thus reflecting confidence in , given the partial agreement of the four rankings used to construct the consensus hyper-parameter.
6 Conclusion
In this paper we have proposed an informative prior distribution for the consensus ranking of the Mallows model with Spearman’s distance. The peculiarity of the proposed prior is that it is a location-scale family for which the location parameter does not need to be a ranking. This is convenient for the elicitation problem, since the prior can naturally handle the case when it is difficult to indicate a full ranking which is a priori the most likely. For instance, when the total number of items in the application considered is very large, it may be unlikely that an expert is able to elicit a prior ranking over all the items. On the contrary, it may be possible to put some prior information only over the top-ranked items. This is often the case in genomics applications, where thousands of genes are considered in the statistical analysis, but only few of them are known to be related to some disease. Another case which is naturally handled by our prior, is when multiple competing rankings are available prior to the analysis, and we are interested in expressing equally strong prior beliefs on them.
A limitation, discussed in Section 4, arises from the intractability of the normalizing constant of (5) when the location parameter is not itself a ranking. Possible directions for future work include exploring tractable approximations for this quantity, perhaps in the spirit of Mukherjee (2016). In general, more efficient methods for posterior simulations might be developed, but these developments fall outside of the scope of the present work. We do hope, however, that some of the ideas presented here can shed light on potentialities and limitations of the Mallows model with Spearman’s distance, and encourage further developments in constructing more flexible priors.
All the simulation algorithms are implemented in R with the cpp package, and will soon be integrated into the BayesMallows R package (Sørensen et al., 2018).
Appendix
Before stating the proof of Proposition 1, let us introduce the formal notion of right-invariance which will prove useful in the proof.
Definition 3**.**
Right-invariant distance* (Diaconis, 1988). A distance function is right-invariant, if for all . With we denote the composition function of two permutations , which is defined as .*
Proof of Proposition 1.
The following two identities hold by right-invariance (see Definition 3):
[TABLE]
Eq. (18) implies that is such that (by Lemma 2 in Hüllermeier et al. (2008)).
By (19), it follows that , is such that , for each .
Now, notice that if and only if , for each . This proves that .
∎
Proof of Proposition 2.
It is sufficient to prove that
[TABLE]
is independent of , for any fixed value of , where
[TABLE]
where the last equality follows from the right-invariance of the Spearman distance. By the same argument, it follows that
[TABLE]
does not depend on , thus ending the proof. ∎
Proof of Theorem 1.
The proof is analogous to that of Gupta and Damien (2002, Theorem 2).
∎
Proof of Proposition 3.
The Jeffreys prior for a parameter is defined as , where is the Fisher information function of the statistical model
[TABLE]
Recall that for the MM it holds
[TABLE]
where . Let us simplify the notation here and set . Then,
[TABLE]
is independent of . Notice also that
[TABLE]
and
[TABLE]
Then,
[TABLE]
From the previous equations, it finally holds the result:
[TABLE]
∎
Acknowledgement
The authors would like to thank Sonia Petrone, Elja Arjas and Arnoldo Frigessi for their insightful comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Berger et al. (2012) Berger, J. O., Bernardo, J. M., and Sun, D. (2012). “Objective priors for discrete parameter spaces.” Journal of the American Statistical Association , 107(498): 636–648.
- 2Crispino (2017) Crispino, M. (2017). “Bayesian Learning of Ranking data.” Unpublished doctoral dissertation, Bocconi University, Milan, Italy. URL https://drive.google.com/file/d/1Sb 2Ul JB Icj Al E Ycd Q 5Wi Sh NK Vn 5Q Yg Xt/view
- 3Dabic and Hatzinger (2009) Dabic, M. and Hatzinger, R. (2009). “Zielgruppenadaequate Ablaeufe in Konfigurationssystemen - Eine empirische Studie im Automobilmarkt - Partial Rankings.” In Hatzinger, R., Dittrich, R., and Salzberger, T., editors, Praeferenzanalyse mit R: Anwendungen aus Marketing, Behavioural Finance und Human Resource Management .
- 4Dawid (1997) Dawid, A. (1997). “Comments on ‘non-informative priors do not exist’.” Journal of Statistical Planning and Inference , 65(1): 178–180.
- 5Diaconis (1988) Diaconis, P. (1988). Group representations in probability and statistics , volume 11 of Lecture Notes - Monograph Series . Hayward, CA, USA: Institute of Mathematical Statistics.
- 6Fligner and Verducci (1986) Fligner, M. A. and Verducci, J. S. (1986). “Distance based Ranking Models.” Journal of the Royal Statistical Society B , 48(3): 359–369.
- 7Fligner and Verducci (1990) — (1990). “Posterior probabilities for a consensus ordering.” Psychometrika , 55(1): 53–63.
- 8Gormley and Murphy (2008) Gormley, I. C. and Murphy, T. B. (2008). “A mixture of experts model for rank data with applications in election studies.” The Annals of Applied Statistics , 2(4): 1452–1477.
