A Bayesian Mallows approach to non-transitive pair comparison data: how human are sounds?
Marta Crispino, Elja Arjas, Valeria Vitelli, Natasha Barrett and, Arnoldo Frigessi

TL;DR
This paper introduces a Bayesian Mallows model to analyze non-transitive pairwise comparison data, revealing insights into human perception of sounds as human or non-human, with applications in sound design and audio industry.
Contribution
It develops a Bayesian Mallows approach with a mixture extension to handle non-transitive preferences and heterogeneity in listener data.
Findings
Model effectively captures preference inconsistencies.
Identifies factors influencing perception of human sounds.
Provides a framework for designing more human-like computer sounds.
Abstract
We are interested in learning how listeners perceive sounds as having human origins. An experiment was performed with a series of electronically synthesized sounds, and listeners were asked to compare them in pairs. We propose a Bayesian probabilistic method to learn individual preferences from non-transitive pairwise comparison data, as happens when one (or more) individual preferences in the data contradicts what is implied by the others. We build a Bayesian Mallows model in order to handle non-transitive data, with a latent layer of uncertainty which captures the generation of preference misreporting. We then develop a mixture extension of the Mallows model, able to learn individual preferences in a heterogeneous population. The results of our analysis of the musicology experiment are of interest to electroacoustic composers and sound designers, and to the audio industry in general,…
| S1: | Pitch, volume, grain duration and spatial variations at their most dynamic ranges. |
| S2: | Spatial motion occurring in front. |
| S3: | Played in mono over one speaker direct-front. |
| S4: | Partial flattening of 3-D spatial variation leaving the main direction changes. |
| S5: | Total flattening of 3-D spatial variation leaving the main direction changes. |
| S6: | Removal of volume variation. |
| S7: | Removal of pitch variation. |
| S8: | Removal of pitch and volume variation. |
| S9: | Partial flattening of 3-D spatial variation; removal of pitch and volume variation. |
| S10: | Total flattening of 3-D spatial variation; removal of pitch and volume variation. |
| S11: | S1 played 30% slower. |
| S12: | S1 played 50% slower. |
| % | % | % | % | % | % | |||||||||||
| 20 | 88 | 0.05 | 92.5 | 2 | 82.5 | 15 | 85 | 50 | 65 | 100 | 44 | |||||
| 30 | 83 | 0.1 | 87.5 | 4 | 95 | 25 | 97.5 | 100 | 58 | 150 | 46 | |||||
| 60 | 83 | 0.15 | 75 | 6 | 92.5 | 35 | 100 | 150 | 60 | 300 | 45 | |||||
| 120 | 75 | 0.2 | 72.5 | |||||||||||||
| Rank | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |||
| G1 |
|
S8 | S10 | S5 | S9 | S6 | S4 | S7 | S11 | S12 | S2 | S3 | S1 | ||
|
|||||||||||||||
| G2 |
|
S5 | S4 | S12 | S2 | S11 | S3 | S6 | S1 | S7 | S9 | S8 | S10 | ||
|
|||||||||||||||
| G3 |
|
S1 | S7 | S11 | S2 | S4 | S12 | S6 | S3 | S5 | S9 | S8 | S10 | ||
|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Bayesian Mallows approach to non-transitive pair comparison data:
how human are sounds?
Marta Crispino1label=e1][email protected] [
Elja Arjas2,3label=e2][email protected] [
Valeria Vitelli3label=e3][email protected] [
Natasha Barrett4label=e5][email protected] [
Arnoldo Frigessi3,5label=e4][email protected] [
Inria Grenoble1111The first author was a PhD student at Bocconi University, Milan, Italy, and visiting OCBE, University of Oslo, during the project., University of Helsinki2, University of Oslo3, Norwegian State Academy for Music in Oslo4, Oslo University Hospital* 5*
Abstract
We are interested in learning how listeners perceive sounds as having human origins. An experiment was performed with a series of electronically synthesized sounds, and listeners were asked to compare them in pairs. We propose a Bayesian probabilistic method to learn individual preferences from non-transitive pairwise comparison data, as happens when one (or more) individual preferences in the data contradicts what is implied by the others. We build a Bayesian Mallows model in order to handle non-transitive data, with a latent layer of uncertainty which captures the generation of preference misreporting. We then develop a mixture extension of the Mallows model, able to learn individual preferences in a heterogeneous population. The results of our analysis of the musicology experiment are of interest to electroacoustic composers and sound designers, and to the audio industry in general, whose aim is to understand how computer generated sounds can be produced in order to sound more human.
Non-transitive pairwise comparisons,
ranking,
Mallows model,
Bayesian preference learning,
recommender systems,
musicology,
acousmatic experiment.,
keywords:
,
,
,
and
1 Introduction
We consider experiments involving a set of assessors (experts, judges, users) who express preferences about a set of items. Each assessor is shown a predetermined sequence of pairs of items, one pair at a time, and chooses from every pair the item that she prefers. Preference is here interpreted in a broad sense as an order relation. The assessors act independently, and typically different sets of pairs are presented to different assessors, varying also their order. An assessor does not have the possibility to go back and check the answers she gave previously, let alone change any answer later. Under such circumstances, often some answers given by an individual assessor, when considered afterwards jointly, do not satisfy logical transitivity of preferences (Tversky, 1969), that is, they may contain a pattern of the form but . On the other hand, neither are the answers given by an individual assessor independent, because conscientious assessors will generally try to follow some logic in their expressed preferences.
Pair comparisons are preferred to ratings or full rankings of a set of items when there are many items to be compared, or when the relative differences between them are small: in both these cases assessors are unlikely to be able to inspect and compare all items jointly in order to perform a full ranking. A pairwise comparison test is then often preferred, and sometimes it is the only possible experimental procedure (Agresti, 1996).
In this paper we consider pairwise comparison data coming from an experiment where each assessor was asked to hear a series of two different abstract sounds, and to tell which one was perceived as more human. Each subject only performed a limited number of comparisons, leading to sparse data, where not all pairs of sounds were compared by each assessor. The results of this test are relevant for musicologists, composers and sound designers, whose aim is to understand how human performance expression can be communicated through spatial audio, leading to computer generated sounds appearing more life-like. Although every sound can be regarded as ‘spatial’ in that sound waves propagate through space, the term ‘spatial audio’ is here used to describe the way sound captures the physical movement in 3-D needed to produce it. The cohort of listeners who took part in the experiment had varying backgrounds, ranging from musicologists to non-specialized university students. Therefore we expected listeners to cluster into groups, sharing different opinions about the degree of human causation behind sounds. In addition to the grouping of the listeners around a shared consensus ranking of the “humanness of sounds”, we were interested in studying the association between individual listeners’ rankings and their own musical experience or musical background. This application is described in detail in Section 2.
Non-transitivity can arise for many reasons, for example assessors’ inattentiveness, uncertainty in their preferences, and actual confusion, even when one specific criterion for ranking is used. These situations are so common that most pairwise comparison data are in fact non-transitive at the individual level, thus creating a need for methods able to predict individual preferences from pairwise choices that lack logical transitivity, and only involve a very limited number of pair comparisons. Notice that the kind of non-transitivity that we consider in this paper regards only the individual level preferences. A different type of non-transitivity arises when aggregating preferences across assessors, as under Condorcet (Marquis of Condorcet, 1785) or Borda (de Borda, 1781) voting rules.
We propose a new method for the analysis of pairwise comparison data that may contain non-transitive individual pairwise comparisons. The method is based on the classical Mallows rank model (Mallows, 1957) and builds on its recent extension introduced by Vitelli et al. (2018). Given pairwise data provided by a collection of individual assessors, the method outputs Monte Carlo samples from the joint posterior distribution for the individual full rankings of all items and an assumed shared consensus ranking between them. In Section 3.3, this hierarchical structure is further relaxed by introducing a mixture model allowing for clustering of the assessors.
The key ingredient, compared to Vitelli et al. (2018), is to add to the model hierarchy one more layer of latent variables, accounting for the possibility that the assessors can make mistakes. By a mistake we mean that the order from an assigned pairwise comparison is reported in a way which is not consistent with the assessor’s own ‘true’ full ranking, whose existence is assumed in the model. The rationale behind our model can be explained as follows. In an ideal situation an assessor would be fully conscious of her preference ordering of all items, and then simply report the consequent ordering each time a pairwise comparison is requested. More realistically, however, she becomes aware of her potential ranking of the items only progressively in time as more pairs are presented to her for comparison. Then it becomes increasingly more difficult to remember exactly what items had been shown earlier and how they had been ordered, with the consequence that reporting results from pairwise comparisons that do not respect transitivity becomes more and more likely. Under such circumstances, particularly when the number of items is larger, the pair comparison data will almost inevitably contain some answers which do not satisfy the requirement of logical transitivity with the rest. Technical errors, such as mistakes in typing, or concentration errors may also occur.
To describe such imperfections in the assessments, we introduce two alternative variants (described in Sections 3.1 and 3.2) of the probabilistic model for mistakes:
The probability of making a mistake is constant, independent of the pairs being assessed, and independent of all other comparisons made by the same assessor. 2. 2.
The probability of making a mistake depends on the items being compared, and is higher for pairs which are more similar to each other.
The literature on inferential models for non-transitive pair data arising at the individual level is limited and discussed in Section 5. As far as we can see, the present paper stands out as the only approach to non-transitive pair data, when the individual hidden rankings are of interest, the same pairs are not repeatedly assessed by each assessor and are few, and a Bayesian approach is of interest. One important feature of our Mallows model is the possibility to choose, for the considered specific application, an appropriate distance function. Some problems require a distance able to measure only the disorder in the given domain, while in others a distance more suited for learning preferences in a population would be preferred. In the former case, the Cayley distance (Cayley, 1849) would be a natural choice, while Kendall (Kendall, 1938) and footrule (Spearman, 1904) would have advantages in the latter. For instance, consider the two rankings and , where the top and bottom elements of are reversed in . The normalized Cayley distance between and is 0.25, while the normalized Kendall distance is 0.7. If and represent the rankings of two assessors of five movies, Kendall’s distance may be more appropriate, as these rankings represent very different profiles: one of the two assessors likes most the movie that the other assessor likes least, and vice versa. However, and differ by a unique translocation: if they represent genomes, we could consider these rankings as very similar and be more eager to use the Cayley distance as metric in the Mallows model. For a detailed description of the distances mentioned and of their properties we refer to Diaconis (1988, Chapter 6).
Our method provides the posterior distribution of the consensus ranking, as well as the posterior distribution of the latent individual rankings for each assessor. The consensus ranking can be seen as a model-based Bayesian aggregation of individual preferences of a group of assessors. It is analogous to the quantities which are usually of interest in the rank aggregation literature (Negahban, Oh and Shah, 2012; Dwork et al., 2001; Kenyon-Mathieu and Schudy, 2007; Rajkumar et al., 2015). The estimated posterior distributions of the individual rankings can be of great interest, for example, when performing personalized recommendations, or in studying how individual preferences change with assessor related characteristics.
This paper is organized as follows. In Section 2 we describe the application which motivated this study, and then in Section 3 we present our model for the statistical analysis of the consequent data. Numerical inference is based on a Markov Chain Monte Carlo algorithm, outlined in Section 4. Section 5 gives a short overview on other methods for the description and analysis of pairwise comparisons. Section 6 is devoted to simulations, while in Section 7 we apply our method to the sound data, showing that the model identifies meaningful clusters of listeners, with similar perception of electroacoustic sounds. Finally, in Section 8 we summarize the contributions of this paper.
2 Acousmatic music experiment
Acousmatic music is a type of electronic music composed for presentation using loudspeakers, as opposed to live or video recorded performance. The composer manipulates digitally recorded sounds, so that the cause of the sound, being a musical instrument or any other sound making system, remains hidden. Indeed, when sounds are played over loudspeakers there are no visual cues to help listeners understand how the sounds were made. On the other hand, when we hear the sound of musical instruments or sounds from our everyday environment, we are able to recognize their cause, since in visual music we obtain the information that indicates the sounding object, i.e. its causation. Since the advent of recording technology, abstract sounds (that is, sounds transformed with computer tools) have been used in much of the sound-world we experience over the Internet, TV and film.
The question of interest is related to the ability of listeners to identify the presence of human causation through the spatial behavior of abstract sounds. Spatial in this context describes the fact that the causation of sound happens as an action in 3-D space. The starting point for the experiment was a high-speed motion tracking recording of the physical movement used to produce one selected sound: a cellist bowing a down-bow chord. Features of this 3-D movement were successively subtracted, resulting in a series of 12 motion data-sets of varying proximity to the original. The motion data were then made audible by a process called parameter-mapping sonification (Grond and Berger, 2011), where parameters in the data are mapped to parameters controlling computer generated sound. The mapping rules are chosen to draw on our everyday perception of spatial motion, which involves not only absolute 3-D spatial location but in addition changes in volume, intensity and pitch, correlated with changes in proximity and speed. In other words, listeners heard the physical spatial motion through sonification, rather than hearing the sound that the motion created, which, in this instance, was the sound of the cello. Testing how listeners perceive a sound for which we lack a clear and commonly understood descriptive vocabulary is problematic. Therefore pair comparisons is the most appropriate design.
2.1 Pair comparison experiment
The total number of stimuli was . Test stimulus 1 (S1) was designed to most clearly sonify all features of the data. Each of the other 11 test stimuli were sonified by modifying one or more features of the data. This involved removing pitch and volume variation, flattening directional changes in the motion, or slowing the overall motion speed (as summarized in Table 1).
Each of the 46 listeners involved in the experiment was exposed to pairs of these sounds, which is ca. 45% of the total number of possible pairs of 12 stimuli. The pairs were chosen randomly, without repetitions, and independently for each assessor. The items in each pair were played in randomized order.
Listeners were then asked to indicate, for each pair, which of the two stimuli most evoked a sensation of human physical movement of any kind, to follow their feelings, rather than imagining to watch a performance. The listeners were not told that the source motion stemmed from a cellist, nor were they asked to identify a specific human spatial movement. Each listener carried out the test sitting centrally to the loudspeaker array. Prior to the experiment, listeners were presented with a short training session of three sounds not used in the test sequence. When the experiment began, the pairs of sounds were played sequentially, listeners noted their answers on a chart, selecting the first or the second from each pair of unlabeled sounds, and were requested to always make a choice even if they found it difficult to decide. If needed, they could ask to hear a test pair for a second time. At the end, they were asked to complete two questionnaires, the aim of which was to assign a Musical Sophistication Index score (MSI) and a rating of Spatial Audio awareness (SAA) to all the listeners. The MSI used was the Ollen musical sophistication index (Ollen, 2006), which is an online survey that tests the validity of 29 indicators of musical sophistication. The SAA index consisted of five questions as indicators of how aware listeners were of spatial audio regardless of musical background. Such a test did not exist in the literature, and was custom designed for the experiment.
The choice to rely on a pairwise comparison experiment is crucially based on the listeners’ lack of experience with abstract sounds. It is easier for the participants to compare two sounds, rather than to be exposed to several, which could create confusion. The experiment, indeed, was difficult as expected: 37 listeners (80%) reported non-transitivities in their pair comparisons, only 9 out of 46 listeners were able to stay consistent with themselves.
A complete description of the background, hypotheses, experimental setup, and discussion of results in Barrett and Crispino (2018).
3 Bayesian Mallows models for non-transitive pairwise comparisons
We consider the situation where assessors independently express their preferences between pairs of the items in . In many situations of practical interest the assessors do not decide on the set of pairs to be considered, which are instead assigned to them by an external authority. In this paper we decided not to model the way in which the pairs are chosen, and simply assume that each assessor receives a different subset of random pairs. Let be the set of pairwise preferences given by assessor , where is the order that assessor assigned to the pair . For example, if , it could be that , , meaning that item is preferred to item . Such data can be incomplete since not all items, nor pairs, are always handled by each assessor. We assume no ties in the data, that is, assessors are forced to express their preference for all pairs in the list assigned to them, and indifference is not permitted.
We denote a generic ranking by , where is the rank of item (the most preferred item has rank ), and is the space of -dimensional permutations. A widely used distance-based family of distributions for ranks is the Mallows model (Mallows, 1957; Diaconis, 1988). According to the Mallows model, the probability density of a given ranking , here denoted by is given by
[TABLE]
In (1), is the location parameter representing the shared consensus ranking, is the scale parameter measuring the concentration of the data around , and is a distance function between two dimensional permutations that satisfies right-invariance (Diaconis, 1988), i.e., , , where . Right-invariance is crucial since from this property it follows that the partition function of (1) does not depend on the location parameter, and can then be written as , where (see for example Mukherjee (2016)). When the distance function in (1) is chosen to be the Kendall, the Cayley, or the Hamming distance, the partition function of the Mallows model is available in closed form (Fligner and Verducci, 1986). For this reason, most of the work on the Mallows was limited to these distances (see, for example, Fligner and Verducci (1986), Lu and Boutilier (2014), Irurozki, Calvo and Lozano (2016, 2014)). The Mallows with other distance functions was less treated because of its computational complexity. Recently, Vitelli et al. (2018) gave a procedure to compute when the footrule and Spearman distances are used, either exactly (up to some moderate values of ), or approximated through an Importance Sampling technique. The authors set the original Mallows model in a Bayesian framework, also allowing for data in the form of transitive pairwise comparisons. We generalize their model (described in Section 4.2 of Vitelli et al. (2018)) to handle non-transitive pairwise comparisons.
The main assumption is that each assessor has a personal latent ranking, , distributed according to the Mallows density (1), We model the situation where each assessor , when announcing her preferences, matches the items under comparison with her latent ranking . Then, if the assessor is consistent with , the pairwise orderings in are induced by according to:
[TABLE]
where denotes the rank of item in . In this case the set of pairwise orderings contains only mutually compatible (a.k.a. transitive) preferences, since the preferences are induced from a complete ranking in that, by definition, is transitive. The transitive closure of a set of pairwise preferences, denoted by , is the smallest set that consistently extends the original preference set: it is defined as the set union of and all pairwise preferences that are not explicitly given but are induced from by transitivity. In this case it is possible to first compute , and second, to make inference on the posterior distribution of the Mallows parameters by integrating out all the rankings that are compatible with the transitive closure of the preference sets, denoted by ,
[TABLE]
This setting was described in Vitelli et al. (2018), Section 4.2.
If the assessor is not fully consistent with her latent ranking, the pairwise orderings in may not be mutually compatible. In such a case the transitive closure may not exist and the previous procedure cannot be followed. Therefore a model able to account for non-transitive patterns in the data is needed in this setting.
We propose a probabilistic strategy based on the assumption that non-transitivities are due to mistakes in deriving the pair order from the latent raking . The likelihood assumed for a set of preferences (analogous to the summation of eq. (3)) is
[TABLE]
where is the probability of ordering the pairs in as in (possibly generating non-transitivities), when the latent ranking for assessor is . It can therefore be seen as forming the error model in this context, which will be specified below. The joint posterior of the model parameters is then:
[TABLE]
In this paper we have assumed a gamma prior, , for , and the uniform prior on , , for .
This strategy is able to recover possible linear orderings close (in terms of some given distance) to the non-transitive sets of preferences. We developed two basic models for the probability of making a mistake: the Bernoulli model (BM) and the Logistic model (LM). BM assumes that non-transitivities arise from random mistakes while LM assumes that non-transitivities arise from mistakes due to difficulty in ordering similar items.
3.1 Bernoulli model (BM)
Assume that the pairwise comparisons given by an assessor are conditionally independent given her latent ranking ,
[TABLE]
We define here a function of a given comparison , and of a given ranking , , where is the index of the preferred item in the -th comparison of assessor , and is the index of the less preferred item. Thus if the preference order of contradicts with that implied by the ranking (in the sense of eq. (2)).
We then assume the following Bernoulli type model for modeling the probability that an assessor makes a mistake in a given pairwise comparison , that is the probability that she reverses the true latent preference implied by her latent ranking :
[TABLE]
Eq. (6) is then given by
[TABLE]
We assign to the truncated Beta distribution on the interval as prior, with given hyperparameters and : , conjugate to the Bernoulli model. We choose the truncated Beta mainly for identification purposes, but this choice is also motivated by the fact that we want to force the probability of making a mistake to be less than 0.5.
Let be a shorthand for , and for .
The posterior density of the model parameters, defined on the support
, has the following form,
[TABLE]
We sample from the density of eq. (7) through an augmented sampling scheme, by first updating and given and , and then updating given and . The former step is performed by using the conditional density
[TABLE]
The second step is performed by using the density
[TABLE]
3.2 Logistic model (LM)
The idea behind the logistic model for mistakes is that an assessor is more likely to be confused (and consequently to make a mistake) if two items in a pair are more similar according to her latent rank vector . We assume the following logistic type model for the probability of making a mistake in a given pairwise comparison
[TABLE]
where is the distance of the ranks of the two items under comparison in , according to : if , then . We assume that and are a priori independent and distributed according to a gamma prior, , and . These choices are motivated by the fact that we want to model a negative dependence between the distance of the items and the probability of making a mistake (), and second, we want to force the probability of making a mistake when the items have ranks differing by 1 to be less than 0.5 (). The posterior density of the model, defined on the support , is then
[TABLE]
Analogously to eq. (7), we sample from the posterior of eq. (10) by first updating and , given and , i.e. from
[TABLE]
Secondly, we update , given and , from
[TABLE]
3.3 Clustering non-transitive assessors
So far we assumed that a unique consensus ranking was shared by all assessors. Since in many situations this assumption is unrealistic, we allow for clustering the assessors into separate subsets, each sharing a consensus ranking of the items. We propose a mixture model generalization of the Bernoulli model of Section 3.1 to deal with heterogeneous assessors expressing pairwise preferences with mistakes.
Let be the class labels indicating how individual assessors are assigned to one of the clusters. Each cluster is described by a different pair of Mallows parameters , , so that the likelihood has the following form:
[TABLE]
where
[TABLE]
We assume that the cluster labels are a priori conditionally independent given the mixing parameters of the clusters, , and distributed according to a categorical distribution
[TABLE]
where , and . Finally we assign to the Dirichlet density with parameter . These choices lead to the following posterior density,
[TABLE]
Similarly to the homogeneous case, we then sample from the posterior of eq. (13) by first updating and , given and , and then updating , given and . The former step is done by using the conditional density
[TABLE]
The second step is performed by using the density
[TABLE]
Since label switching is not handled inside our MCMC, MCMC iterations are re-ordered after convergence has been achieved, by applying the algorithm of Stephens (2000).
4 MCMC for non-transitive pairwise preferences
We develop a Markov Chain Monte Carlo (MCMC) algorithm which, at convergence, samples from the posterior density of eq. (7). As explained in Section 3.1, the MCMC iterates between two main steps:
Update and given and (using eq. (8)):
- (a)
Metropolis update of 2. (b)
Metropolis update of 3. (c)
Gibbs update of 2. 2.
Update given and (using eq. (9)).
In step 1(a), we propose a new consensus ranking according to a symmetric proposal which is centered around the current consensus ranking .
Definition 1**.**
Swap proposal. At step , denote the current version of the consensus ordering vector by , which is the vector whose components are the items in ordered from best to worst according to , i.e., . Let . Sample uniformly an integer from and draw a random number uniformly in . The proposal has components
[TABLE]
and the proposed ranking is .
The parameter is the maximum allowed distance between the ranks of the swapped items, and is used for tuning the acceptance probability in the Metropolis-Hastings step. The transition probability of the Swap proposal is symmetric, and given by q({\bm{\rho}}^{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}p}}\rightarrow{\bm{\rho}}^{t})=\frac{1}{L^{*}}\sum_{l=1}^{L^{*}}\frac{1}{n-l}\mathbb{1}(|{{\bm{\rho}}^{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}p}}-{\bm{\rho}}^{t}}|=2l). The ranking is then accepted with probability , where
[TABLE]
In step 1(b) we propose from a log-normal density and accept it with probability , where
[TABLE]
This acceptance probability takes into account the asymmetric transition probability of the chain, that results from the log-normal proposal. The partition function can be computed exactly or approximated by the importance sampling scheme proposed by Vitelli et al. (2018), depending on the distance function chosen and on the number of items considered.
In step 1(c) we sample from the beta distribution, truncated to the interval , with updated hyper-parameters,
[TABLE]
Step 2 is a Metropolis-Hastings for the individual rankings. Here we exploit the fact that, when fixing all other parameters and the data , are conditionally independent, and that each only depends on the corresponding data . We thus sample a proposed individual ranking from the Swap proposal, separately for each . The Swap proposal is here advantageous because it perturbs locally not only the current individual ranking , but also the function .
Remark. The Swap proposal always gives a proposed individual ranking . However, it may happen that , .
This is important for what concerns the acceptance probability of {\bm{r}}^{p}_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}j}. If , , the acceptance probability depends only on the ratio of the Mallows likelihoods of and , and is equal to , where
[TABLE]
If for some , the acceptance probability depends also on the mistake model, and is equal to where
[TABLE]
Example. To illustrate this step of the algorithm, suppose that an assessor expresses the following set of preferences,
[TABLE]
*This set contains the non-transitive pattern . For the illustration, suppose that the current value of the individual ranking vector is , which corresponds to the ordering vector , and for which . If we sample the proposal , this gives , , and . However, if we sample , then and also since, according to the sampled , the preference is reversed.
Appropriate convergence of the MCMC must in practice be checked by inspecting the trace plots of the parameters, and by monitoring for example the integrated autocorrelation. In LABEL:suppA we explain in detail how the algorithm is adapted to the case of the logistic mistake model, and to the mixture extension.
5 Other approaches to pairwise preference data
In the classical Bradley-Terry model (BT) for pair comparisons (Bradley and Terry, 1952) the probability that item is preferred to item is expressed as the ratio
[TABLE]
where , is a vector of item-specific consensus ratings shared by all assessors, forming a linear scale of score parameters. From this follows that the odds for against are given by . In addition, it is assumed that all pairwise comparisons are conditionally independent given . Therefore, the likelihood expression of the BT model corresponding to data consisting of several pair comparisons is the product, across all considered pairs, of terms of the form (17). For this reason, all pairwise data, even when they may have come from a number of individual assessors, are effectively merged when performing inference on .
The work of Bradley and Terry was preceded by two important earlier papers, by Thurstone (1927) and Zermelo (1929). Thurstone considered a similar preference data context as BT, but the work was based on a Gaussian error model. Zermelo (1929), in contrast, proposed exactly the same model as Bradley and Terry, but it was presented as a statistical model for the results from a chess tournament, without the presence of individual assessors. After these pioneering works, several extensions of the basic BT model have been presented, mostly in the econometric and psychometric literature. Often these papers apply the logarithmic transformation of the parameters, with the effect that the probabilities (17) get the familiar logistic form. The logit of the odds for against is then equal to the contrast between the corresponding logarithmic scores. Extensions to regression models that account for the influence of item specific covariates on the comparison results are then readily available; for more comments on this, see below.
Data generated from the BT model are often not transitive, and this is the case particularly when some contrasts are close to 0. In situations in which the actual data come from a number of individual assessors, as was the case in our musicology experiment, it is a natural idea to try to account in the modeling separately for the two sources that may have created non-transitivity in the combined data: One the one hand, the differences in the assessment profiles of the assessors, and on the other, possible lack of transitivity in the pairwise comparisons coming from each individual assessor. This distinction was made fully explicit in the structure of our BM and BL models of Sections 3.1 and 3.2.
As an alternative to our approach, an anonymous referee suggested a hierarchical two-layer structure based on the BT model. In that suggestion, data coming from an individual assessor would be described by a BT model, but with score parameters specific to each assessor . On the lower level of model hierarchy, the referee suggested that, for each item , the score parameters for different assessors would be sampled independently from a Gaussian distribution centered at a common value .
We developed such a model, which we call HBT (with H for hierarchical), work in progress (Crispino and Frigessi, 2018). There, we discuss (i) the suitability of the HBT model for data in the form of repeated pairwise comparisons performed by each assessor, and (ii) the poor performance of HBT compared to our Bayesian Mallows approach when data are such that each assessor only performs a limited number of comparisons without repetitions, so that not all pairs of items are compared by every assessor. Such a small incomplete example, with no intransitivities, is considered in Liu et al. (2018), where further differences between the HBT and the Bayesian Mallows model are discussed. One important reason for the difference in the case of incomplete and sparse data is that often they do not satisfy the strong connection condition (Ford, 1957). This condition is fulfilled if, for any partition of all items into two non-empty sets, both subsets contain at least one item that was preferred to some item in the other set by at least one assessor, see Yan (2016). If this condition is not satisfied, the maximum likelihood estimator does not exist and the posterior inferences based on the HBT model will be highly sensitive to the specification of the prior and will require corresponding sensitivity analyses.
The BT model was represented and fitted as a log-linear model (Dittrich, Hatzinger and Katzenbeisser, 1998, 2002). In these works, the authors introduced assessor specific covariates into their framework, and extended it to the case of dependent pair comparisons. Building on Dittrich, Hatzinger and Katzenbeisser (1998), Francis, Dittrich and Hatzinger (2010) further introduced random effects for each assessor in order to account for residual heterogeneity that is not included in individual-specific covariates. However, their method is applied to pair preferences derived from full rankings. As such, the pair preferences are complete, that is, pairs are assessed by each assessor, and transitive. Their method cannot be used on our data where each assessor provides a limited number of pairwise preferences, typically smaller than the maximum , and is allowed to contradict herself, thus leading to non-transitive patterns in the data.
An interesting literature that builds on the Thurstone’s model is the psychometric one (Bockenholt, 1988; Böckenholt, 2001; Böckenholt and Tsai, 2001; Böckenholt, 2006). In these works, the authors develop different generalizations of the Thurstone model, accounting for instance for multidimensional parameters, in case the items are evaluated with respect to multiple aspects, or introducing dependency among the observed pairs, by the inclusion of random effects in the model. However inference is performed when the data include repeated comparisons for each assessor, and all items are compared by each assessor.
Pair comparison data were also recently handled within the Mallows ranking models by Lu and Boutilier (2014) and Vitelli et al. (2018). However, both papers deal only with transitive pairs, explicitly ruling out the non-transitive patterns in the data.
Volkovs and Zemel (2014) propose a score-based method, called Multinomial Preference model (MPM), that generalizes the Plackett Luce model (Luce, 1959; Plackett, 1975). The main difference between their MPM and our model is in the data generating mechanisms, which Volkovs and Zemel (2014) assumed to be a multinomial score based process, while our method builds on considering distances between ranking vectors. In addition, their goal is to learn a single consensus ranking of the items, or multiple consensus rankings in case of clustering. Our method instead has the ability to further learn the individual latent rankings for each assessor.
Ding, Ishwar and Saligrama (2015) proposed a model for noisy pairwise ranking data, based on a mixed membership of Mallows models (M4), which generalizes the mixture model of Lu and Boutilier (2014). Their proposal is near to ours, in that both postulate the existence of latent linear orderings. However, Ding, Ishwar and Saligrama (2015) assume a basic separability property, which would be difficult to justify in contexts similar to our data application. Furthermore, they model the presence of non-transitive patterns in the data as arising because each assessor has multiple latent linear orderings, while we propose a mistake model. Moreover, they consider only the Kendall distance, while our model handles every right-invariant distance.
There is a large body of literature on mixture models for ranking data (e.g Murphy and Martin, 2003; Gormley and Murphy, 2006; Caron, Teh and Murphy, 2014; Meilǎ and Chen, 2010; Jacques and Biernacki, 2014). Although related to our mixture model extension, all these papers are based on data in the form of rankings, and they do not directly apply, or extend, to non-transitive pairwise comparison data. Apart from this difference, the work of Jacques and Biernacki (2014), which presents a mixture extension of the model developed in Biernacki and Jacques (2013), has some similarities with ours. These authors assume the existence of a consensus ranking, and of individual rankings, and they model stochastic errors between these permutations of the items, to explain the variability of the individual rankings around the consensus. In this way, the pairwise comparisons are always complete and transitive, in contrast to our setting.
6 Simulation study
The aim of the experiments was to validate the method and to evaluate its performance in some test situations. The data were simulated from the Mallows model with the Bernoulli mistake model, varying parameters , , , , and , , while always using the footrule distance. The number of items was always kept below 50, thus enabling us to use the exact partition function (Vitelli et al., 2018). For a detailed description of the data generation, see LABEL:suppB.
Various point estimates can be deduced from the posterior distribution of one being the maximum a posteriori (MAP). We prefer the following sequential construction, called the cumulative probability (CP) consensus ordering in Vitelli et al. (2018): first we select the item which has the largest marginal posterior probability of being ranked ; then, excluding this first choice, we select the item which has the largest marginal posterior probability of being ranked or among the remaining ones, and so on.
In order to assess the performance of our methods, in Figure 1 we plot the posterior distribution of the normalized footrule distance between the estimated consensus and the true consensus, , for varying parameters , , (the average number of pairs given to each assessor) and , while keeping fixed .
As expected, the performance of the method improves as the number of assessors increases (Figure 1a), as the probability of making mistakes decreases (Figure 1b), as the dispersion of the individual latent rankings around decreases, that is when increases (Figure 1c), and when the average number of pairwise comparisons becomes larger (Figure 1d). Interestingly, in the last case, the method performs generally well also when the average number of pairs is , being only 1/3 of the maximal number of pairs possible.
In Figure 2 we plot the posterior distribution of corresponding to simulation experiments with , when increasing the number of assessors . Note that the number of pairs assessed by each assessor in the case is around 50, which is 1/6 of all the possible pairs.
Next, we studied the performance of the method in terms of the precision of the individual ranking estimation. We quantified the results by the probability of getting at least 3 items right, among the top-, defined as follows. For each assessor , we found the triplet of items that had maximum posterior probability of being ranked jointly among the top items, i.e. the triplet that maximized \sum_{\sigma\in\mathcal{P}_{3}}\mathbb{P}(\{R_{j{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}i_{1}}},R_{j{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}i_{2}}},R_{j{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}i_{3}}}\}=\sigma\,|\,\text{data}), where denotes a permutation of the set . This posterior quantity was estimated along the MCMC trajectory. We defined to be the set of 5 highest ranked items in , for each assessor . We then checked whether (that is, if the top-3 estimated items were all among the top-5 of each assessor). The percentages of assessors for which this is true are reported in Table 2. We notice that the results are overall very good: in the cases where is set to 10 (first 4 sub-tables from the left in Table 2), we consistently learn 3 out of the top items in more than 70% of the assessors (with a peak of 100%). Also in the more difficult cases of and (first 2 sub-tables from the right in Table 2) the results are very good, especially considering that this percentage does not include the cases where only 2 (or 1) items where correctly estimated in the top positions.
We then chose randomly one of the simulated data cases and computed the posterior probabilities of correctly predicting the preference order of all pairs not assessed by the assessors, i.e. \mathbb{P}\Big{[}g({\mathcal{B}}_{j,\text{new}},\bm{R}_{j})=g({\mathcal{B}}_{j,\text{new}},\bm{R}_{j}^{\text{true}})\,\Big{|}\,\text{data}\Big{]}. Figure 3 shows the boxplots for these predictive probabilities, (left) stratified according to the number of pairs each assessor assessed in the data, and (right) stratified according to the footrule distance between the individual ranking and the consensus , .
In the case considered, the model had a very good predictive power, especially considering that the simulated data had many mistakes (around 10%). We also notice a slight increase of the predictive probabilities as increases (left panel) and as decreases (right panel). These results are not surprising: it is easier to predict correct orderings of new pairs when (i) the assessor assesses more pairs, and (ii) the assessor’s own ranking resembles more the shared consensus.
In LABEL:suppC we report an analysis of data generated by the logistic model LM. The results were very similar to those obtained above. In fact, the posterior distribution of was highly concentrated around 0, which is when LM collapses to BM.
7 Human causation in sounds
We analyzed the data using the mixture model explained in Section 3.3 with footrule distance. With sounds we can use the exact expression of the partition function (Vitelli et al., 2018). In the Dirichlet prior for , we set , which favors high-entropy distributions, thus reflecting our inability to express precise prior knowledge. In the Beta prior for , we set the hyperparameters at , that is, the uniform distribution on the interval , and the hyperparameters of the prior for at and , as discussed in Vitelli et al. (2018). We run the MCMC sampler for iterations, after a burn-in of . Separate analyses were performed for .
In order to choose an appropriate number of clusters, we plot in Figure 4 two quantities: on the left, the within-cluster sum of footrule distances between the individual rankings and the consensus ranking of that cluster, ; on the right, the within-cluster indicator of mis-fit to the data, . Both these measures are defined in Vitelli et al. (2018), and tested as good measures to select .
More traditional information criteria, such as the deviance information criterion (Spiegelhalter et al., 2002), were considered, however their performance was quite unstable, possibly attributable to the sparsity of the data.
Inference on the number of clusters could have been alternatively performed via a reversible jump MCMC.
There appears to be an elbow in the figures at , to guide us in the choice of the number of clusters. We decided on , also motivated by the relatively small sample size of the experiment ().
Table 3 shows the results for : the maximum a posteriori (MAP) estimates for and , together with their 95% highest posterior density (HPD) intervals, are shown at the top of the table. The table also shows the estimated cluster-specific consensus lists of sounds, estimated by the CP procedure. We observe the differences in the three consensus lists. S1, the stimulus with the most dynamic spatial motion, is on top in cluster 3, but at the bottom in cluster 1; S8, the test stimulus that has maximum spatial details but no volume nor pitch change, is on top in cluster 1, but second to the last in clusters 2 and 3. Finally, S5, the stimulus that contains the least movement variation but has pitch and volume suppressed, is ranked third and first in clusters 1 and 2, but towards the bottom of the list in cluster 3.
Listeners in cluster 1 found variation in volume or pitch as a negative or distracting feature. They rated S8 at the top, a test stimulus that has maximum spatial details but no volume nor pitch change. Also, S10, S5 and S9, which were ranked next, lack volume and pitch details. The bottom 4 stimuli contain maximum pitch and volume variation. Among them was S3 (mono sound, no space at all), forming a strong contrast to the top ranked S8 (maximum spatial movement). Evidently, space was important for these listeners, while pitch and volume variation was a negative or distracting feature.
In cluster 2 listeners did not like fast movements as a sign of human feature, but they did like correlated pitch and volume (the top 4 sounds feature a low amount of spatial variation, but also correlated pitch and volume, while the bottom 3 sounds are the same as the top 3 but lack correlated pitch and volume variation). Listeners in this cluster prioritized pitch and volume variations above spatial variation, and preferred low spatial variation (slower, or more relaxed movements).
Cluster 3 consists of subjects who, in their evaluation of the test stimuli, appear to include all spatial cues that adhere to our everyday perception of spatial motion. The stimuli with most dynamic spatial motion, enhanced by spatially correlated pitch and volume variations, are in the top-3, while stimuli with the least of these features are in the bottom-3. These listeners prioritize high levels of spatial detail above all other features, and their perception of these details are enhanced by correlated pitch and volume variations. This is indicated in (i) S1 being at the top; (ii) S7, which is the same as S1 but lacks pitch variation, being second; (iii) S11, which is the same as S1 but played 30% slower, being third (i.e. space, volume and pitch variations are just a bit slower); (iv) S8, S9, S10 are in the bottom, and all lack pitch, volume variation, and spatial movement details.
We investigate the stability of the clustering in Figure 5, that shows the heatplot of the posterior probabilities, for all the listeners (shown on the x-axis), for being assigned to each of the clusters identified in Table 3. Most of the probabilities are concentrated on some particular value of among the three possibilities, indicating a reasonably precise behavior in the cluster assignments.
We then computed, fixing these cluster assignments, the marginal posterior probability that each sound is among the top-4 in and in , , respectively. The results are shown in Figure 6. Each heatplot refers to a cluster (G1 (left), G2 (center) and G3 (right)) and represents the marginal posterior probabilities for each sound (y-axis) being ranked among the top-4 in the consensus of that cluster (first column), and in the individual rankings of listeners in that cluster (remaining columns, assessors on the x-axis). As Figure 6 shows, there is considerable variation in the estimated rankings of the sounds between individual listeners even when they are included in the same cluster. For example, looking at Figure 6 left, we see that S8, S10, and S5 have high () posterior probability of being ranked among the top-4 stimuli in the consensus ranking (column 1). However, looking at the estimates for the listeners in cluster 1, we see that the variation is very high: For example, listener 30 (column with label 30) has a very high posterior probability of ranking S3 and S6 among the top-4 stimuli. This aspect is important for what concerns individual estimates.
Here we consider the relationship between the probability of placing some given stimuli in the top (bottom) ranks and the musical sophistication index (MSI), or the spatial audio awareness index (SAA). Figure 7 shows the relationship between listeners’ SAA and the probability of sounds S1 and S7 being ranked in the top-4 (both marginally and jointly). Recall that S1 was the original sound, while S7 was identical to S1, but without pitch variation. The plot suggests that spatial listening is a skill that is enhanced through training.
Figure 8 shows the relationship between listeners’ MSI and the probability of sounds S8 and S10 being ranked among the bottom-4 (both marginally and jointly). Respondents with a score greater than 500 were classified as musically more sophisticated, and those with a score less than 500 as less sophisticated, as suggested in http://marcs-survey.uws.edu.au/OMSI/omsi.php. Both S8 and S10 suppress pitch and volume variations, which are expected to enhance the implication of human causation. These two stimuli are more likely to be ranked in the last 4 positions by listeners with high MSI. Interestingly, this suggests that musically sophisticated listeners find pitch and volume variations to be qualities for a stimulus to sound human.
8 Conclusions and discussion
The main contribution of this paper is to introduce a new Bayesian method for non-transitive pairwise preference data. The principal advantage of the Bayesian approach comes from its ability to combine different types of uncertainty in the reported data, coming from different sources, and from being able to convert such data into the form of meaningful probabilistic inferences. Our method provides the posterior distribution of the consensus ranking, based on pairwise assessment data from a pool of assessors who may have individually violated logical transitivity in their reporting. The method is also able to produce the posterior distributions of the latent individual rankings of the assessors. Such rankings can be used in the construction of personalized recommendations, or in studying how individual preferences change with assessor related covariates. We also developed a mixture model generalization of the main model, able to handle heterogeneity in pairwise and non-transitive preference data. The model was then used to investigate how individual listeners perceive human spatial causation in acousmatic sounds. The data came from a difficult experiment, that involved human perceptions. For this reason, pair comparison of sounds was the only feasible design. The data were noisy, and in particular often logically non-transitive at the individual level. We used our approach to estimate individual rankings, and sub-groups of assessors. The results revealed how differently people listen to and interpret abstract sounds. We related individual musicological scores to individual rankings, leading to an interesting correspondence between spatial sound feelings and sound expertise.
Sometimes pairwise comparison data contain draws, or ties. A tie occurs when a pairwise comparison between two items does not result in a defined preference of an item towards the other. This situation has been much considered in the literature on pairwise comparisons (e.g. Rao and Kupper, 1967; Davidson, 1970). Our method does not model probabilistically the presence of ties, but it is possible to handle them directly in the MCMC procedure: apply the proposed model, and simply break each tie by tossing a symmetric coin inside the MCMC.
Another extension of the model would be to allow for the possibility of including covariates of subjects and/or items in the analysis. For instance, the probability of making a mistake could depend on some characteristics of the items, so that, the more similar two items are in terms of such characteristics, the more likely it is to make a mistake in reporting the pairwise preference. In our application relevant covariates could be the variation in pitch and volume, or the overall motion speed that characterizes each sound.
The time complexity of our algorithm is linear in terms of the number of assessors . The increase of the number of items does not affect computing time of a single MCMC step. However, the larger is, the longer the chain must be in order to reach convergence.
Acknowledgments
MC thanks Sonia Petrone and Isadora Antoniano-Villalobos for useful discussions. MC was partially funded by Cariplo during the project. The authors thank the Editor and the anonymous reviewers for their useful comments and suggestions which helped improving the paper. We thank Øystein Sørensen for his contributions to Bayesian Mallows modeling.
{supplement}
[id=suppA] \snameSupplement A \stitleAdaptations of the algorithm \slink[doi]10.1214/00-AOASXXXXSUPP
{supplement}
[id=suppB] \snameSupplement B \stitleSome remarks on the simulated data \slink[doi]10.1214/00-AOASXXXXSUPP
{supplement}
[id=suppC] \snameSupplement C \stitleSimulations with the Logistic DGP \slink[doi]10.1214/00-AOASXXXXSUPP
{supplement}
[id=suppD] \snameSupplement D \stitleMCMC diagnostics \slink[doi]10.1214/00-AOASXXXXSUPP
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Agresti (1996) {bbook} [author] \bauthor \bsnm Agresti, \bfnm Alan \binits A. ( \byear 1996). \btitle Categorical data analysis. \bpublisher New York: John Wiley & Sons. \endbibitem
- 2Barrett and Crispino (2018) {barticle} [author] \bauthor \bsnm Barrett, \bfnm Natasha \binits N. and \bauthor \bsnm Crispino, \bfnm Marta \binits M. ( \byear 2018). \btitle The Impact of 3-D Sound Spatialisation on Listeners’ Understanding of Human Agency in Acousmatic Music. \bjournal Journal of New Music Research \bpages 1–17. \endbibitem
- 3Biernacki and Jacques (2013) {barticle} [author] \bauthor \bsnm Biernacki, \bfnm Christophe \binits C. and \bauthor \bsnm Jacques, \bfnm Julien \binits J. ( \byear 2013). \btitle A generative model for rank data based on insertion sort algorithm. \bjournal Computational Statistics & Data Analysis \bvolume 58 \bpages 162–176. \endbibitem
- 4Bockenholt (1988) {barticle} [author] \bauthor \bsnm Bockenholt, \bfnm U \binits U. ( \byear 1988). \btitle A logistic representation of multivariate paired-comparison models. \bjournal Journal of mathematical psychology \bvolume 32 \bpages 44–63. \endbibitem
- 5Böckenholt (2001) {barticle} [author] \bauthor \bsnm Böckenholt, \bfnm Ulf \binits U. ( \byear 2001). \btitle Hierarchical modeling of paired comparison data. \bjournal Psychological Methods \bvolume 6 \bpages 49. \endbibitem
- 6Böckenholt (2006) {barticle} [author] \bauthor \bsnm Böckenholt, \bfnm Ulf \binits U. ( \byear 2006). \btitle Thurstonian-based analyses: Past, present, and future utilities. \bjournal Psychometrika \bvolume 71 \bpages 615–629. \endbibitem
- 7Böckenholt and Tsai (2001) {barticle} [author] \bauthor \bsnm Böckenholt, \bfnm Ulf \binits U. and \bauthor \bsnm Tsai, \bfnm Rung-Ching \binits R.-C. ( \byear 2001). \btitle Individual differences in paired comparison data. \bjournal British Journal of Mathematical and Statistical Psychology \bvolume 54 \bpages 265–277. \endbibitem
- 8Bradley and Terry (1952) {barticle} [author] \bauthor \bsnm Bradley, \bfnm Ralph Allan \binits R. A. and \bauthor \bsnm Terry, \bfnm Milton E \binits M. E. ( \byear 1952). \btitle Rank analysis of incomplete block designs: I. The method of paired comparisons. \bjournal Biometrika \bvolume 39 \bpages 324–345. \endbibitem
