Learning partially ranked data based on graph regularization

Kento Nakamura; Keisuke Yano; Fumiyasu Komaki

arXiv:1902.10963·stat.ME·March 1, 2019

Learning partially ranked data based on graph regularization

Kento Nakamura, Keisuke Yano, Fumiyasu Komaki

PDF

Open Access

TL;DR

This paper introduces a graph-regularized EM algorithm for estimating parameters from partially ranked data, effectively handling non-ignorable missing mechanisms with theoretical guarantees and improved accuracy.

Contribution

It proposes a novel estimation method combining graph regularization with EM to address non-ignorable missing data in partial rankings, reducing modeling bias.

Findings

01

Estimates perform well under non-ignorable missing mechanisms.

02

The method has guaranteed convergence properties.

03

Experimental results validate the effectiveness of the approach.

Abstract

Ranked data appear in many different applications, including voting and consumer surveys. There often exhibits a situation in which data are partially ranked. Partially ranked data is thought of as missing data. This paper addresses parameter estimation for partially ranked data under a (possibly) non-ignorable missing mechanism. We propose estimators for both complete rankings and missing mechanisms together with a simple estimation procedure. Our estimation procedure leverages a graph regularization in conjunction with the Expectation-Maximization algorithm. Our estimation procedure is theoretically guaranteed to have the convergence properties. We reduce a modeling bias by allowing a non-ignorable missing mechanism. In addition, we avoid the inherent complexity within a non-ignorable missing mechanism by introducing a graph regularization. The experimental results demonstrate that…

Equations109

[τ] = {π \in S_{r} ∣ π^{- 1} (i) = τ^{- 1} (i) (i = 1, \dots, t (τ))} .

[τ] = {π \in S_{r} ∣ π^{- 1} (i) = τ^{- 1} (i) (i = 1, \dots, t (τ))} .

P (t, π) = P (t ∣ π) P (π),

P (t, π) = P (t ∣ π) P (π),

P (τ) = π \in [τ] \sum P (t (τ) ∣ π) P (π) .

P (τ) = π \in [τ] \sum P (t (τ) ∣ π) P (π) .

L (θ, ϕ; τ_{(n)}) := - i = 1 \sum n lo g π \in [τ_{i}] \sum P (t (τ_{i}) ∣ π; ϕ) P (π; θ) .

L (θ, ϕ; τ_{(n)}) := - i = 1 \sum n lo g π \in [τ_{i}] \sum P (t (τ_{i}) ∣ π; ϕ) P (π; θ) .

d (π_{1}, π_{2}) := d min {d : π_{2} = a_{d} \circ \dots \circ a_{1} \circ π_{1}, a_{1}, \dots, a_{d} \in A},

d (π_{1}, π_{2}) := d min {d : π_{2} = a_{d} \circ \dots \circ a_{1} \circ π_{1}, a_{1}, \dots, a_{d} \in A},

{P (π; σ, c) = \frac{exp { - c d ( π , σ )}}{Z ( c )} : c > 0, σ \in S_{r}},

{P (π; σ, c) = \frac{exp { - c d ( π , σ )}}{Z ( c )} : c > 0, σ \in S_{r}},

{P (π; c, σ, w) = k = 1 \sum K w_{k} \frac{exp { - c _{k} d ( π , σ _{k} )}}{Z ( c _{k} )} : c_{k} > 0, σ_{k} \in S_{r}, w_{k} > 0, k = 1 \sum K w_{k} = 1}

{P (π; c, σ, w) = k = 1 \sum K w_{k} \frac{exp { - c _{k} d ( π , σ _{k} )}}{Z ( c _{k} )} : c_{k} > 0, σ_{k} \in S_{r}, w_{k} > 0, k = 1 \sum K w_{k} = 1}

P (t ∣ π; ϕ)

P (t ∣ π; ϕ)

Φ

(\hat{θ} (τ_{(n)}), \hat{ϕ} (τ_{(n)})) = argmin_{θ \in Θ, ϕ \in Φ} L_{λ} (θ, ϕ; τ_{(n)}) .

(\hat{θ} (τ_{(n)}), \hat{ϕ} (τ_{(n)})) = argmin_{θ \in Θ, ϕ \in Φ} L_{λ} (θ, ϕ; τ_{(n)}) .

L_{λ} (θ, ϕ; τ_{(n)}) = L (θ, ϕ; τ_{(n)}) + λ {π, π^{'}} \in E \sum ∥ ϕ_{π} - ϕ_{π^{'}} ∥_{2}^{2},

L_{λ} (θ, ϕ; τ_{(n)}) = L (θ, ϕ; τ_{(n)}) + λ {π, π^{'}} \in E \sum ∥ ϕ_{π} - ϕ_{π^{'}} ∥_{2}^{2},

θ^{m + 1}

θ^{m + 1}

ϕ^{m + 1}

q_{i, π}^{m + 1}

q_{i, π}^{m + 1}

q_{(n)}^{m + 1}

L (θ; τ_{(n)}, q_{(n)}^{m + 1})

L_{λ} (ϕ; τ_{(n)}, q_{(n)}^{m + 1})

σ^{m + 1}

σ^{m + 1}

c^{m + 1}

ϕ^{l + 1}

ϕ^{l + 1}

φ^{l + 1}

u^{l + 1}

L_{ρ} (ϕ, φ, u; q_{(n)}^{m + 1})

L_{ρ} (ϕ, φ, u; q_{(n)}^{m + 1})

+ {π, π^{'}} \in E \sum {λ ∥ φ_{π, π^{'}} - φ_{π^{'}, π} ∥_{2}^{2} - \frac{ρ}{2} (∥ u_{π, π^{'}} ∥_{2}^{2} + ∥ u_{π^{'}, π} ∥_{2}^{2})

+ \frac{ρ}{2} (∥ ϕ_{π} - φ_{π, π^{'}} + u_{π, π^{'}} ∥_{2}^{2} + ∥ ϕ_{π^{'}} - φ_{π^{'}, π} + u_{π^{'}, π} ∥_{2}^{2})},

q_{π, t}^{m + 1} := i : t (τ_{i}) = t \sum q_{i, π}^{m + 1},

q_{π, t}^{m + 1} := i : t (τ_{i}) = t \sum q_{i, π}^{m + 1},

π, π^{'} \in S_{r} d (π, π^{'}) = 1 \sum ∥ ϕ_{π} - ϕ_{π^{'}} ∥_{2}^{2} = {π, π^{'}} \in E \sum ∥ ϕ_{π} - ϕ_{π^{'}} ∥_{2}^{2}

π, π^{'} \in S_{r} d (π, π^{'}) = 1 \sum ∥ ϕ_{π} - ϕ_{π^{'}} ∥_{2}^{2} = {π, π^{'}} \in E \sum ∥ ϕ_{π} - ϕ_{π^{'}} ∥_{2}^{2}

P (t (τ) ∣ π; ϕ) = P (t (τ) ∣ π^{'}; ϕ), (π, π^{'} \in [τ]) .

P (t (τ) ∣ π; ϕ) = P (t (τ) ∣ π^{'}; ϕ), (π, π^{'} \in [τ]) .

L (θ, ϕ; τ_{(n)})

L (θ, ϕ; τ_{(n)})

= L (ϕ; τ_{(n)}) + L (θ; τ_{(n)}),

L_{λ} (θ^{m}, ϕ^{m}; τ^{(n)}) \geq L_{λ} (θ^{m + 1}, ϕ^{m + 1}; τ^{(n)}), m = 1, 2, \dots,

L_{λ} (θ^{m}, ϕ^{m}; τ^{(n)}) \geq L_{λ} (θ^{m + 1}, ϕ^{m + 1}; τ^{(n)}), m = 1, 2, \dots,

L_{λ} (θ, ϕ; τ_{(n)}, z_{(n)})

L_{λ} (θ, ϕ; τ_{(n)}, z_{(n)})

= - i = 1 \sum n lo g π \in [τ_{i}] \prod {ϕ_{π, t (τ_{i})} P (π; θ)}^{z_{i, π}} + λ {π, π^{'}} \in E \sum ∥ ϕ_{π} - ϕ_{π^{'}} ∥_{2}^{2} .

= ⎩ ⎨ ⎧ - i = 1 \sum n π \in [τ_{i}] \sum z_{i, π} lo g P (π; θ) ⎭ ⎬ ⎫ + ⎩ ⎨ ⎧ - i = 1 \sum n π \in [τ_{i}] \sum z_{i, π} lo g ϕ_{π, t (τ_{i})} + λ {π, π^{'}} \in E \sum ∥ ϕ_{π} - ϕ_{π^{'}} ∥_{2}^{2} ⎭ ⎬ ⎫

= L (θ; τ_{(n)}, z_{(n)}) + L_{λ} (ϕ; τ_{(n)}, z_{(n)}) .

ϕ =

ϕ =

ϕ, φ =

ϕ, φ =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Rough Sets and Fuzzy Logic · Multi-Criteria Decision Making

Full text

Learning partially ranked data

based on graph regularization

Kento Nakamura, Keisuke Yano, and Fumiyasu Komaki

Department of Mathematical Informatics,

Graduate School of Information Science and Technology,

The University of Tokyo

{kento_nakamura,yano,komaki}@mist.i.u-tokyo.ac.jp

(Date: .)

Abstract.

Ranked data appear in many different applications, including voting and consumer surveys. There often exhibits a situation in which data are partially ranked. Partially ranked data is thought of as missing data. This paper addresses parameter estimation for partially ranked data under a (possibly) non-ignorable missing mechanism. We propose estimators for both complete rankings and missing mechanisms together with a simple estimation procedure. Our estimation procedure leverages a graph regularization in conjunction with the Expectation-Maximization algorithm. Our estimation procedure is theoretically guaranteed to have the convergence properties. We reduce a modeling bias by allowing a non-ignorable missing mechanism. In addition, we avoid the inherent complexity within a non-ignorable missing mechanism by introducing a graph regularization. The experimental results demonstrate that the proposed estimators work well under non-ignorable missing mechanisms.

Key words and phrases:

Alternating Direction Method of Multipliers; Expectation-Maximization algorithms; Kendall distances; Mallows models; Missing data

1. Introduction

Data commonly come in the form of ranking in preference survey such as voting and consumer surveys. Asking people to rearrange items according to their preference, we obtain the collection of rankings. Several methods for ranked data have been proposed. Mallows (1957) proposed a parametric model, now called the Mallows model; Diaconis (1989) developed a spectral analysis for ranked data; Recently, the analysis of ranked data has gathered much attention in machine learning community (see Liu (2011); Fürnkranz and Hüllermeier (2011)). See Section 1.2 for more details.

Partially ranked data is often observed in real data analysis. This is because one does not necessarily express his or her preference completely; for example, according to the election records of the American Psychological Association collected in 1980, one-third of ballots provided full preferences for five candidates, and the rest provided only top- $t$ preferences with $t=1,2,3$ (see Section 2A in Diaconis (1989)); Data are commonly of partially ranked in movie ratings because respondents usually know only a few movie titles among a vast number of movies. Therefore, analyzing partially ranked data efficiently extends the range of application of statistical methods for ranked data.

Partially ranked data is thought of as missing data. We can naturally consider that there exists a latent complete ranking behind a partial ranking as discussed in Lebanon and Mao (2008). The existing studies for partially ranked data make the Missing-At-Random (MAR) assumption, that is, an assumption that the missing mechanism generating partially ranked data is ignorable; Under the MAR assumption, Busse et al. (2007) and Meilă and Chen (2010) leverage an extended distance for partially ranked data; Lu and Boutilier (2011) introduces a probability model for partially ranked data. However, an improper application of the MAR assumption may lead to a relatively large estimation error as argued in the literature on missing data analysis (Little and Rubin (2014)). In the statistical sense, if the missing mechanism is non-ignorable, using the MAR assumption is equivalent to using a misspecified likelihood function, which causes significantly biased parameter estimation and prediction. In fact, Marlin and Zemel (2009) points out that there occurs a violation of the MAR assumption in music rankings.

This paper addresses learning the distribution of complete and partial rankings based on partially ranked data under a (possibly) non-ignorable missing mechanism. Our approach includes estimating a missing mechanism. However, estimating a missing mechanism has an intrinsic difficulty. Consider a top- $t$ ranking of $r$ items. Length $t$ characterizes the missing pattern generating a top- $t$ ranking from a complete ranking with $r$ items. It requires $r!(r-2)$ parameters to fully parameterize the missing mechanism since $r!$ multinomial distributions with $r-1$ categories models the missing mechanism. Note that the number of complete rankings is $r!$ . A large number of parameters cause over-fitting especially when the sample size is small. To avoid over-fitting, we introduce an estimation method leveraging the recent graph regularization technique (Hallac et al. (2015)) together with the Expectation-Maximization (EM) algorithm. The numerical experiments using simulation data as well as applications to real data indicate that our proposed estimation method works well especially under non-ignorable missing mechanisms.

1.1. Contribution

In this paper, we propose estimators for the distribution of a latent complete ranking and for a missing mechanism. To this end, we employ both a latent variable model and a recently developed graph regularization. Our proposal has two merits: First, we allow a missing mechanism to be non-ignorable by fully parameterizing it. Second, we reduce over-fitting due to the complexity of missing mechanisms by exploiting a graph regularization method.

Our ideas for the construction of the estimators are two-fold: First, we work with a latent structure behind partially ranked data (see Figure 1). This structure consists of the graph representing complete rankings (in the top layer) and arrows representing missing patterns. In this structure, a vertex in the top layer represents a latent complete ranking; An edge is endowed by a distance between complete rankings; An arrow from the top layer to the bottom layers represents a missing pattern; A multinomial distribution on arrows from a complete ranking corresponds to a missing mechanism. Second, we assume that two missing mechanisms become more similar as the associated complete rankings get closer to each other on the graph (in the top layer). Together with both the restriction to the probability simplex and the EM algorithm, these ideas are implemented by the graph regularization method (Hallac et al. (2015)) under the probability restriction. In addition, we discuss the convergence properties of the proposed method.

The simulation studies as well as applications to real data demonstrate that the proposed method improves on the existing methods under non-ignorable missing mechanisms, and the performance of the proposed method is comparable to those of the existing methods under the MAR assumption.

1.2. Literature review

Relatively scarce is the literature on the inference for ranking-related data with (non-ignorable) missing data. Marlin et al. (2007) points out that the MAR assumption does not hold in the context of the collaborative filtering. Marlin et al. (2007) and Marlin and Zemel (2009) propose two estimators based on missing mechanisms. These estimators show the higher performance both in prediction of rating and in the suggestion of top- $t$ ranked items to users than estimators ignoring a missing mechanism. Using the Plackett-Luce and the Mallows models, Fahandar et al. (2017) introduces a rank-dependent coarsening model for pairwise ranking data. This study is different from these studies in types of ranking-related data: Marlin and Zemel (2009) and Marlin et al. (2007) discuss rating data; Fahandar et al. (2017) discusses pairwise ranking data; this study discusses partially ranked data.

Several methods have been proposed for estimating distributions of partially ranked data (Beckett (1993); Busse et al. (2007); Meilă and Chen (2010); Lebanon and Mao (2008); Jacques and Biernacki (2014); Caron et al. (2014)). These methods regard partially ranked data as missing data. Beckett (1993) discusses imputing items on missing positions of a partial ranking by employing the EM algorithm. Busse et al. (2007) and Meilă and Chen (2010) discuss the clustering of top- $t$ rankings by the existing ranking distances for top- $t$ rankings. Lebanon and Mao (2008) proposes a non-parametric model together with a computationally efficient estimation method for partially ranked data. For the proposal, Lebanon and Mao (2008) exploits the algebraic structure of partial rankings and utilizes the Mallows distribution as a smoothing kernel. Jacques and Biernacki (2014) proposes a clustering algorithm for multivariate partially ranked data. Caron et al. (2014) discusses Bayesian non-parametric inferences of top- $t$ rankings on the basis of the Plackett-Luce model. Caron et al. (2014) does not explicitly rely on the framework that regards partially ranked data as the result of missing data; However, the model discussed in Caron et al. (2014) is equivalent to that under the MAR assumption. Overall, all previous studies rely on the MAR assumption, whereas our study is the first attempt to estimate the distribution of partially ranked data with a (possibly) non-ignorable missing mechanism.

We work with the graph regularization framework called Network Lasso (Hallac et al. (2015)). Network Lasso employs the alternating direction method of multipliers (ADMM; see Boyd et al. (2011)) to solve a wide range of regularization-related optimization problems of a graph signal that cannot be solved efficiently by generic optimization solvers. In addition, Network Lasso has a desirable convergence property, cooperates with distributed processing systems, and has been applied to various optimization problems on a graph. We present an application of Network Lasso to missing data analysis. In the application, we coordinate Network Lasso with the probability simplex constraint and the EM algorithm.

1.3. Organization

The rest of this paper is organized as follows. Section 2 formulates a probabilistic model of partially ranked data based on a missing mechanism, and introduces a distance-based graph structure for a complete ranking. Section 3 proposes the regularized estimator for both a latent complete ranking and a missing mechanism. We also discuss the convergence properties of the proposed estimation procedure. Section 4 demonstrates the result of simulation studies and real data analysis. Section 5 concludes the paper. The concrete algorithm of the proposed estimation procedure and The proof of the convergence property are provided in Appendices A and B, respectively.

2. Preliminaries

2.1. Notation

We begin with introducing notations for analyzing partially ranked data. In this paper, we identify a complete ranking of $r$ items $\{1,\ldots,r\}$ with a permutation that maps each item $i\in\{1,\ldots,r\}$ uniquely to a corresponding rank $\{1,\ldots,r\}$ . A top- $t$ ranking is a list of $t$ items out of $r$ items. We identify a top- $t$ ranking with a permutation that maps an item in a subset of items uniquely to a corresponding rank in $\{1,\ldots,t\}$ .

We denote by $S_{r}$ the collection of all complete rankings of $r$ items. We denote by $\overline{S}_{r}$ the collection of all top- $t$ rankings with $t$ running through $t=1,\ldots,r-1$ . We denote by $t(\tau)$ the length of a partial ranking $\tau\in\overline{S}_{r}$ . A tuple of a complete ranking $\pi$ and length $t$ uniquely determines a top- $t$ ranking $\tau$ : $\pi^{-1}(i)=\tau^{-1}(i)$ for $i=1,\ldots,t$ , where $\pi^{-1}$ and $\tau^{-1}$ denote the inverses of $\pi$ and $\tau$ , respectively. We define the collection of complete rankings compatible with a given ranking $\tau\in\overline{S}_{r}$ as

[TABLE]

2.2. Partial rankings and a missing mechanism

Next we introduce notations and terminologies for a probabilistic model with a missing mechanism.

A probabilistic model for top- $t$ ranked data with a general missing mechanism consists of a probabilistic model of generating complete rankings and that of a missing mechanism. The joint probability of a complete ranking and a missing mechanism is decomposed as

[TABLE]

where $P(\pi)$ determines how a complete ranking is filled, and $P(t\mid\pi)$ specifies a missing pattern conditioned on the latent complete ranking $\pi$ . Then, the probability of a top- $t$ ranking $\tau\in\overline{S}_{r}$ is obtained by marginalizing a latent complete ranking out:

[TABLE]

Now distributions $P(t\mid\pi)$ and $P(\pi)$ are parameterized by $\phi$ and $\theta$ , respectively; hence $P(\tau;\theta,\phi)=\sum_{\pi\in[\tau]}P(t\mid\pi;\phi)P(\pi;\theta)$ . We call $\{P(\pi;\theta):\theta\in\Theta\}$ with a parameter space $\Theta$ a complete ranking model and $\{P(t\mid\pi;\phi):\phi\in\Phi\}$ with a parameter space $\Phi$ a missing model. We call $\theta$ a complete ranking parameter, and call $\phi$ a missing parameter, respectively. Given the i.i.d. observations $\tau_{(n)}=\{\tau_{1},\ldots,\tau_{n}\in\overline{S}_{r}\}$ , we denote the negative log-likelihood function by

[TABLE]

2.3. Distances, graph structures, and distributions on complete rankings

We end this section with introducing distances, graph structures, and distributions on complete rankings.

We endow the class $S_{r}$ of complete rankings with a distance structure as follows. Since we identify the class $S_{r}$ as a the class of permutations, we endow $S_{r}$ with the symmetric group structure and its group law $\circ$ . Using this identification, we leverage distances on symmetric groups for distances on $S_{r}$ . There exist a large number of distances on $S_{r}$ such as the Kendall distance, the Spearman rank correlation metric, and the Hamming distance. Among these, the Kendall distance (Kendall (1938)) has been often used in statistics and machine learning. The Kendall distance between two complete rankings $\pi_{1}$ and $\pi_{2}$ , $d(\pi_{1},\pi_{2})$ , is defined as

[TABLE]

where $\mathcal{A}$ is the whole class of adjacent transpositions. This distance is suitable for describing similarity between preferences because the transform of a complete ranking by a single adjacent transposition is just the exchange of the $i$ -th and $(i+1)$ -th preferences. In this paper, we focus on the Kendall distance $d$ as a distance on $S_{r}$ .

There exists a one-to-one correspondence between the distance structure with the Kendall distance and a graph structure. Set the vertex set $V=S_{r}$ and set the edge set $E=\{\{\pi,\pi^{\prime}\}:\text{ there exists }b\in\mathcal{A}\text{ such that }\pi=b\circ\pi^{\prime}\}$ . Then the distance structure $(S_{r},d)$ corresponds one-to-one to the graph $G=(V,E)$ , since the Kendall distance $d(\pi,\pi^{\prime})$ is its minimum path length between the vertices corresponding to $\pi$ , $\pi^{\prime}$ . Remark that $E$ is rewritten as $E=\{\{\pi,\pi^{\prime}\}:d(\pi,\pi^{\prime})=1\}$ .

We introduce a well-known probabilistic model for complete rankings. The Mallows model (Mallows (1957)) is one of the most popular probabilistic models for complete rankings. The Mallows model associated with the Kendall distance is defined as

[TABLE]

where $\sigma$ is a location parameter indicating a representative ranking, $c$ is a concentration parameter indicating a decay rate, and $Z(c)=\sum_{\pi\in S_{r}}\exp\{-cd(\pi,\sigma)\}$ is a normalizing constant that depends only on $c$ . The mixture model of $K\in\mathbb{N}$ Mallows distributions is defined as

[TABLE]

where $\bm{c}=\{c_{k}\}_{k},\bm{\sigma}=\{\sigma_{k}\}_{k},\bm{w}=\{w_{k}\}_{k}$ represent the parameters of each mixture component. The Mallows mixture model has been used for estimation and clustering analysis of ranked data (Murphy and Martin (2003); Busse et al. (2007)).

3. Proposed Method

In this section, we propose estimators for both complete ranking and missing models together with a simple estimation procedure. Here we assume that the parameterization of the complete ranking and missing models is separable, that is, $\theta$ and $\phi$ are distinct. We use the following missing model $\{P(t\mid\pi;\phi):\phi\in\Phi\}$ to allow a non-ignorable missing mechanism:

[TABLE]

We make no assumptions on a complete ranking model $\{P(\pi;\theta):\theta\in\Theta\}$ .

3.1. Estimators and estimation procedure

We propose the following estimators for $\theta$ and $\phi$ : On the basis of i.i.d. observations $\tau_{(n)}=\{\tau_{1},\ldots,\tau_{n}\in\overline{S}_{r}\}$ ,

[TABLE]

Here $L_{\lambda}$ with a regularization parameter $\lambda>0$ is defined as

[TABLE]

where $\phi_{\pi}$ with $\pi\in S_{r}$ denotes the vector $(\phi_{\pi,0},\ldots,\phi_{\pi,(r-1)})$ , and recall that $L(\theta,\phi;\tau_{(n)})$ is the negative log-likelihood function (1) and $E$ is the edge set of the graph induced by the Kendall distance; see Subsections 2.2-2.3.

We conduct minimization in the definition of $\hat{\theta},\hat{\phi}$ using the following EM algorithm: At the $(m+1)$ -th step, set

[TABLE]

where for $i\in\{1,\ldots,n\}$ ,

[TABLE]

Consider minimizations (3) and (4). Minimization (3) depends on the form of a complete ranking model $P(\pi;\theta)$ ; For example, consider the Mallows model with $\theta=(\sigma,c)$ (see Section 2.3). In this case, we write down the minimization of $\theta$ at the $(m+1)$ -th step as follows:

[TABLE]

See Busse et al. (2007) for more details. Minimization (4) in the $(m+1)$ -th step is conducted using the following iteration: At the $(l+1)$ -th step, set

[TABLE]

Here, $\varphi\in\mathbb{R}^{r!(r-1)^{2}}$ is the copy variable of $\phi$ , $u\in\mathbb{R}^{r!(r-1)^{2}}$ is the dual variable, and $L_{\rho}$ is an augmented Lagrangian function with a penalty constant $\rho$ defined as

[TABLE]

where

[TABLE]

for all $\pi\in S_{r}$ and $t=1,\ldots,r-1$ . Note that $q^{m+1}_{\pi,t}=0$ when $\{i:t(\tau_{i})=t\}$ is an empty set. The detailed algorithm is provided in Appendix A.

Remark 3.1 (Meaning of the penalty term).

We make the assumption that two complete rankings close to each other in the Kendall distance have smoothly related missing probabilities. This assumption leads to adding a ridge penalty

[TABLE]

to the negative log-likelihood function. This assumption is reasonable because the Kendall distance measures the similarity of preferences expressed by two rankings.

Remark 3.2 (The proposed method under the MAR assumption).

For top- $t$ ranked data, the MAR assumption is expressed as

[TABLE]

Then under the MAR assumption, the negative log-likelihood function $L(\theta,\phi;\tau_{(n)})$ is decomposed as

[TABLE]

which indicates that the parameter estimation for $\phi$ is unnecessary for estimating $\theta$ .

3.2. Convergence

In this subsection, we discuss theoretical guarantees for two procedures (3)-(8) and (9)-(11).

It is guaranteed that the sequence $\{L_{\lambda}(\theta^{m},\phi^{m};\tau_{(n)}):m=1,2,\ldots\}$ obtained using the procedure (3)-(8) monotonically decreases,

[TABLE]

because the procedure is just the EM algorithm. Introduce a latent assignment variable $z_{(n)}=\{z_{i}\}_{i}$ ( $z_{i}\in\mathbb{R}^{|S_{r}|}$ for every $i=1,\ldots,n$ . $z_{i\pi}=1$ if $\tau_{i}$ is the missing from $\pi$ and $z_{i\pi}=0$ otherwise). Using $z_{(n)}$ , we decompose the likelihood function as follows:

[TABLE]

On the basis of the decomposition, the standard procedure of the EM algorithm yields the iterative algorithm shown in (3)-(8). Note that it depends on a complete ranking model $P(\pi;\theta)$ whether the convergent point of the sequence is a local maximum of $L(\theta,\phi;\tau_{(n)})$ ; see Section 3 of McLachlan and Krishnan (2007).

Next, it is guaranteed that the sequence $\phi^{l+1}$ obtained using the procedure (9)-(11) converges to the global minimum of $L_{\lambda}(\phi;\tau_{(n)},q_{(n)}^{m})$ in the sense of $L_{\lambda}(\phi;\tau_{(n)},q^{m}_{(n)})$ .

Proposition 3.1.

The sequence $\{L_{\lambda}(\phi^{l},\tau_{(n)},q^{m+1}_{(n)})\}_{l=1}^{\infty}$ converges to $\min_{\phi}L_{\lambda}(\phi,\tau_{(n)},q^{m+1}_{(n)})$ .

The proof is provided in Appendix B. The basis of the proof is reformulating the optimization problem (4) as an instance of the alternating direction method of multipliers (ADMM; Boyd et al. (2011): We rewrite the problem (4) as follows:

[TABLE]

where $V$ is the vertex set of the graph defined in Section 2.3 and $q_{\pi,t}=\sum_{i:t(\tau_{i})=t}\sum_{k=1}^{K}q_{i,k,\pi}$ . Introducing a copy variable $\varphi$ on the edge set, we recast the optimization problem (14) into an equivalent form:

[TABLE]

Note that this reformulation follows the idea of Hallac et al. (2015). We employ ADMM to solve the optimization of the sum of objective functions of splitted variables under linear constraints.

4. Numerical experiments

In this section, we apply our methods to both simulation studies and real data analysis. In simulation studies, we use the Mallows mixture models (2) with two types of missing models. In the real data analysis, we use the election records of the American Psychological Association collected in 1980.

4.1. Performance measures

We evaluate the performance of several estimators for $\theta$ and $\phi$ in estimating distributions of a latent complete ranking and of a partial ranking. We measure the performance using the following total variation losses: When the true values of a complete ranking and missing parameters are $\theta$ and $\phi$ , respectively, the losses of estimators $\hat{\theta}$ and $\hat{\phi}$ are given as

[TABLE]

Losses $L_{\mathrm{par}}$ and $L_{\mathrm{comp}}$ measure the estimation losses for partial and complete ranking distributions, respectively.

4.2. Method comparison

We compare our estimators with the estimator based on the maximum entropy approach proposed by Busse et al. (2007) and a non-regularized estimator, abbreviated by ME and NR, respectively. In addition, we use the proposed estimator with the regularization parameter selected using two-fold cross-validation based on $L_{\mathrm{par}}$ .

We denote the proposed method introduced in section 3 with the value of regularization parameter $\lambda$ as R $\lambda$ and that with the regularization parameter selected using cross-validation as RCV.

The maximum entropy approach (ME) uses an extended distance between top- $t$ rankings to introduce an exponential family distribution of a top- $t$ ranking. From the viewpoint of missing data analysis, ME implicitly assumes the MAR assumption. For this reason, in the maximum entropy approach, we estimate the missing model parameter $\phi$ by assuming homogeneous missing probabilities $P(t\mid\pi)=\phi_{t}\ (\forall\pi\in S_{r})$ and using the maximum likelihood in the evaluation of the loss $L_{\mathrm{par}}(\theta,\phi)$ .

The non-regularized estimator (NR) is the minimizer of the non-regularized likelihood function $L(\theta,\phi;\tau_{(n)})$ . The estimation based on the non-regularized likelihood function can be implemented straightforwardly.

4.3. Stopping criteria

In simulation studies, we use the following stopping criteria and hyperparameters. We terminate the iteration of the EM algorithm when the change of the likelihood of the observable distribution gets lower than $\epsilon=1$ . We terminate the iteration of ADMM when both the primal and dual residuals got less than $\epsilon_{p}=\epsilon_{d}=1$ or when the number of iterations exceeded 100. In addition, to prevent being trapped in local minima due to the EM algorithm, we use the following devices. First, we use 10 different initial location parameters in the EM algorithm. Second, we make the value of the location parameter transit from the current to a different one in the first five iterations of the EM algorithm.

4.4. Simulation studies

We conducted two simulation studies. The data-generating models are as follows: For complete ranking models, we use the Mallows and Mallows mixture models. For missing models, we use a binary missing mechanism in which the possible missing patterns are only that no items are missing or that all but the first items are missing. We parameterize missing models in such a way that there is a discrepancy between the distribution of a complete ranking generated by the latent Mallows model and the marginal distribution of a partial ranking restricted to $S^{(r-1)}_{r}:=\{\tau:t(\tau)=r-1,\tau\in\overline{S}_{r}\}$ . Note that $S^{(r-1)}_{r}$ is identical to $S_{r}$ as a set.

In each simulation, we generate 100 datasets with sample size of $n=1000$ . We set the number of items to $r=5$ .

4.4.1. Tilting the concentration parameter

In the first simulation study, we use the Mallows model and the missing model that tilts the concentration parameter $c$ : The missing model is parameterized by $c^{\ast}>0$ and $R\in[0,1]$ as

[TABLE]

where

[TABLE]

In this parameterization, the parameter $c^{\ast}$ specifies the degree of concentration of the marginal distribution $P(\tau;\theta,\phi)$ of a partial ranking restricted to $S^{(r-1)}_{r}$ : If $\{Z(c)/Z(c^{\ast})\}R\exp\{-(c^{\ast}-c)d(\pi,\sigma_{0})\}\leq 1$ , $P(\tau;\theta,\phi)$ has the form of the Mallows distribution with the concentration parameter $c^{\ast}$ :

[TABLE]

where $\pi(i)=\tau(i),i=1,\ldots,r-1$ . The parameter $0\leq R\leq 1$ specifies the proportion of partial rankings in $S^{(r-1)}_{r}$ . We set $c=1$ , $R=0.7$ , and $c^{\ast}\in\{0.8,1,1.2\}$ .

Figures 2 and 3 show the results. When $c^{\ast}\neq 1$ , the proposed methods outperform ME both in $L_{\mathrm{par}}$ and $L_{\mathrm{comp}}$ . When $c^{\ast}=1$ , the proposed methods underperform compared to ME. These results reflect that the setting with $c^{\ast}=1$ satisfies the MAR assumption, whereas the settings with $c^{\ast}\neq 1$ do not satisfy the MAR assumption. For $L_{\mathrm{par}}$ , the proposed methods outperform NR regardless of the values of $c^{\ast}$ . However, there are subtle distinctions in the values of $L_{\mathrm{comp}}$ of these methods. The performance of the proposed method with the cross-validated regularization parameter (RCV) is comparable with that of the proposed method with the optimal regularization parameter both for $L_{\mathrm{par}}$ and $L_{\mathrm{comp}}$ , indicating the utility of cross-validation.

4.4.2. Tilting the mixture coefficient

In the second simulation study, we use the Mallows mixture model with two clusters and the missing model that tilts the mixture coefficient $w$ . We instantiate a missing model, in which missing probabilities depend on the cluster assignment $k\in\{1,\ldots,K\}$ , such that $P(t\mid\pi,z_{k}=1)=P(t\mid z_{k}=1)=\phi_{k,t}$ , where $z_{k}=1$ if and only if the assigned cluster is $k$ and $z_{k}=0$ otherwise. Then, the missing model is parameterized by $w^{\ast}\in[0,1]$ and $R\in[0,1]$ as

[TABLE]

where $C_{k}(w^{\ast},R)=(w^{\ast}_{k}/w_{k})R.$ In this parameterization, the parameter $w^{\ast}$ determines the mixture coefficient of the marginal distribution $P(\tau;\theta,\phi)$ of a partial ranking restricted to $S^{(r-1)}_{r}$ . We set the parameter values as follows:

•

$\bm{\sigma}=((1,2,3,4,5),(3,2,5,4,1))$ , $\bm{c}=(1,1)$ , and $w=(0.5,0.5)$ ;

•

$R=0.7$ and $w^{\ast}=\{(0.5,0.5),(0.6,0.4),(0.7,0.3)\}$ .

In this simulation study, we additionally use the classification error as the performance measure.

Figures 4–6 show the results. The proposed methods outperform ME when $w_{1}^{\ast}\neq 0.5$ in comparing $L_{\mathrm{par}}$ ; when $w_{1}^{\ast}$ is $0.7$ in comparing $L_{\mathrm{comp}}$ ; when $w_{1}^{\ast}\neq 0.5$ in comparing the classification error. The proposed methods outperform NR both in comparing $L_{\mathrm{par}}$ and $L_{\mathrm{comp}}$ except when $w_{1}^{\ast}$ is $0.7$ for $L_{\mathrm{comp}}$ . As $w^{\ast}$ deviates from $0.5$ , the classification error of ME increases. On the other hand, the classification errors of the other methods decrease.

4.5. Application to real data

We apply the proposed method to real data. We use the election records for five candidates collected by the American Psychological Association. Among the 15549 vote casts, only 5141 filled all candidates ( $t=5,4$ ); 2108 filled $t=3$ ; 2462 filled $t=2$ ; and the rest filled only $t=1$ .

For comparison, we chose several pairs of train and test datasets randomly to measure $L_{\mathrm{par}}$ since we do not have the true values of the model parameters nor the form of the model. To see the dependence of the estimation performance on the sample size, we used different sizes ( $n=100,500,1000,5000,10000$ ) of the train datasets, whereas we fixed the size of the test datasets to $n=3000$ . We sampled test datasets independently $30$ times and sampled train datasets from the remaining data independently $30$ times for each size. In calculating $L_{\mathrm{par}}$ , we used the empirical distribution of the employed test dataset as the true distribution. For a complete ranking model, we made the use of the likelihood of the Mallows mixture model with the number of clusters set to 3 as in Busse et al. (2007). Since R1 performs poorly in terms of $L_{\mathrm{par}}$ according to the simulation study, we eliminate R1 from the candidate of two-fold cross-validation.

Figures 7 and 8 show the result. When the sample size is small ( $n=100,500$ ), the proposed method is comparable to ME, and NR works poorly. When the sample size is moderate ( $n=1000$ ), the proposed method outperforms both ME and NR. When the sample size is large ( $n=5000,10000$ ), the proposed method outperforms ME, and it is comparable to NR. These results indicate that considering non-ignorable missing mechanisms contributes to the improvement of the performance when the sample size is sufficient, while the graph regularization reduces over-fitting when the sample size is insufficient.

5. Conclusion

We proposed a regularization method for partially ranked data to prevent modeling bias due to the MAR assumption and avoid over-fitting due to the complexity of missing models. Our simulation experiments showed that the proposed method improves on the maximum entropy approach (Busse et al. (2007)) under non-ignorable missing mechanisms. They also showed that the proposed method improves on the non-regularized estimator especially in estimating distribution of a partial ranking. Our real data analysis suggested that moderate or large sample sizes attribute the improvement by the proposed method and the proposed method is effective in reducing over-fitting.

We propose two main tasks for future work. The first task is to improve the computational efficiency of our method since it was not a priority in this study. Leveraging partial completion of items (instead of full completion) might be effective for reducing the computational cost. For this purpose, the distance of top- $t$ ranking described in Busse et al. (2007) might be beneficial for the construction of the graph. The second task is to develop cross-validation or an information criterion for inferring the distribution of a latent complete ranking. In this study, we employed cross-validation based on the distribution of a partial ranking. When the distribution of a latent complete ranking is of interest, cross-validation based on the distribution of a latent complete ranking would be more suitable. However, the construction of such cross-validation would be difficult because the empirical distribution of a latent complete ranking cannot be obtained directly, which rises ubiquitously where one uses the EM algorithm for the estimation of latent variables. There have been several derivations of information criteria comprising the distribution of latent variables (Shimodaira (1994); Cavanaugh and Shumway (1998)). We conjecture that these derivations would be useful for inferring partially ranked data.

Appendix A Algorithms

In this appendix, we provide a concise algorithm to conduct ADMM in (9)-(11). In the algorithm, $\lambda$ is the regularization parameter, $\rho$ is the penalty constant, and $\epsilon_{p},\epsilon_{d}$ are two parameters for stopping the algorithm.

Appendix B proof of proposition 3.1

Proof of Proposition 3.1: First, we express optimization (3.2) at the $m$ -th step of iteration (4) using an extended-real-valued function as follows:

[TABLE]

where the functions $f:\mathbb{R}^{r!(r-1)}\rightarrow\mathbb{R}\cup\{\infty\}$ and $g:\mathbb{R}^{r!(r-1)^{2}}\rightarrow\mathbb{R}$ are defined as

[TABLE]

Note that the effective domain $\mathrm{dom}(f)=\{\phi\in\mathbb{R}^{r!(r-1)}\mid f(\phi)<\infty\}$ is identical to the parameter space $\Phi=\left\{\phi\in\mathbb{R}^{r!(r-1)}:\sum_{t=1}^{r-1}\phi_{\pi,t}=1,\phi_{\pi,t}\geq 0,\pi\in S_{r}\right\}$ . It suffices to show the following two convergences for the sequence $\{\phi^{l},\varphi^{l},u^{l}:l=0,1,\ldots\}$ generated by iteration (9)-(11).

•

Residual convergence: the primal residual $\bar{r}^{l}\in\mathbb{R}^{r!(r-1)^{2}}$ defined by $\bar{r}_{\pi,\pi^{\prime},t}^{l}:=\phi_{\pi,t}^{l}-\varphi_{\pi,\pi^{\prime},t}^{l}$ converges to [math] with respect to $l$ : $\lim_{l\rightarrow\infty}\bar{r}^{l}=0;$

•

Objective convergence: the convergence

[TABLE]

holds.

Objective convergence together with residual convergence implies convergence of the objective function $L_{\lambda}(\phi;\tau_{(n)},q^{m+1}_{(n)})$ , because we have

[TABLE]

where it follows from residual convergence and the continuity of $g$ that $|g(\phi^{l})-g(\varphi^{l})|\rightarrow 0$ .

The following is a sufficient condition for objective and residual convergence based on ADMM (see Section 3.2 of Boyd et al. (2011)):

(I)

The functions $f$ and $g$ are closed, proper, and convex; 2. (II)

Unaugmented Lagrangian $\widetilde{L}_{0}$ has a saddle point.

Here unaugmented Lagrangian $\widetilde{L}_{0}$ is defined as

[TABLE]

In what follows, we show that conditions (I) and (II) hold.

Confirming condition (I): $g$ is clearly a closed, proper, and convex function because $g$ is a positive quadratic function. Each function $f_{\pi,t}\ (\pi\in S_{r},t\in\{1,\ldots,r-1\})$ is closed because every level set $V_{\gamma}=\{x\in\mathbb{R}^{r!(r-1)}\mid f_{\pi,t}(\phi)\}$ with $\gamma\in\mathbb{R}$ is a closed set:

[TABLE]

Therefore, $f$ is closed. $f$ is proper because $f\geq 0>-\infty$ everywhere and $f(\phi)<\infty$ for $\phi\in\mathbb{R}^{r!(r-1)}$ satisfying $\phi_{\pi,t}=1/(r-1),\ \pi\in S_{r},t=1,\ldots,r-1$ . Each function $f_{\pi,t}$ $(\pi\in S_{r},t\in\{1,\ldots,r-1\})$ is convex because the effective domain $\mathrm{dom}(f_{\pi,t})$ is a convex set and $\nabla^{2}f_{\pi,t}(\phi)$ is positive semidefinite for all $\phi\in\mathrm{dom}(f_{\pi,t})$ . Therefore, $f$ is convex. Thus, condition (I) holds.

Confirming condition (II): We employ the following sufficient condition for the existence of a saddle point described as Assumption 5.5.1 and Proposition 5.5.6 in Section 5.5 of Bertsekas (2015):

(i)

For each $\phi\in\mathbb{R}^{r!(r-1)},\varphi\in\mathbb{R}^{r!(r-1)^{2}}$ , $-\widetilde{L}_{0}(\phi,\varphi,\cdot):\mathbb{R}^{r!(r-1)^{2}}\rightarrow\mathbb{R}\cup\{\infty\}$ is convex and closed; 2. (ii)

For each $y\in\mathbb{R}^{r!(r-1)^{2}}$ , $\widetilde{L}_{0}(\cdot,\cdot,y):\mathbb{R}^{r!(r-1)}\times\mathbb{R}^{r!(r-1)^{2}}\rightarrow\mathbb{R}\cup\{\infty\}$ is convex and closed; 3. (iii)

Functions $L^{+}$ and $L^{-}$ are proper, where $L^{+}:\mathbb{R}^{r!(r-1)}\times\mathbb{R}^{r!(r-1)^{2}}\rightarrow\mathbb{R}\cup\{\infty\}$ and $L^{-}:\mathbb{R}^{r!(r-1)^{2}}\rightarrow\mathbb{R}\cup\{\infty\}$ are defined as

[TABLE] 4. (iv)

For each $\gamma\in\mathbb{R}$ , the level set $\{\phi,\varphi\mid L^{+}(\phi,\varphi)\leq\gamma\}$ is compact; 5. (v)

For each $\gamma\in\mathbb{R}$ , the level set $\{y\mid L^{-}(y)\leq\gamma\}$ is compact.

Condition (i) holds because $-\widetilde{L}_{0}(\phi,\varphi,\cdot)$ is linear for $\phi\in\mathrm{dom}(f)$ ) and $-\infty$ for $\phi\not\in\mathrm{dom}(f)$ . Condition (ii) holds because $\widetilde{L}_{0}(\cdot,\cdot,y)$ is the sum of convex and closed functions.

To confirm condition (iii), we will show that $L^{+}$ and $L^{-}$ are proper. Set $\phi^{\ast}\in\mathbb{R}^{r!(r-1)}$ and $\varphi^{\ast}\in\mathbb{R}^{r!(r-1)^{2}}$ such that

[TABLE]

It follows that $L^{+}$ is proper since

[TABLE]

It follows that $L^{-}$ is proper since

[TABLE]

Therefore, condition (iii) is confirmed.

To confirm conditions (iv) and (v), it suffices to show that the level sets are closed and bounded. Since the function obtained by taking the point-wise supremum of a family of closed functions is again closed, both $L^{+}$ and $L^{-}$ are closed and thus their level sets are closed. The remaining part of the proof is to show that the level sets of $L^{+}$ and $L^{-}$ are bounded.

We will show that all level sets of $L^{+}$ are bounded by focusing on the effective domain of $L^{+}$ . We show that the effective domain $\mathrm{dom}(L^{+})$ is a subset of the bounded set

[TABLE]

according to which all level sets of $L^{+}$ are bounded. If $\phi_{\widetilde{\pi},\widetilde{t}}-\varphi_{\widetilde{\pi},\widetilde{\pi}^{\prime},\widetilde{t}}\neq 0$ for some $\{\widetilde{\pi},\widetilde{\pi}^{\prime}\}\in E$ and $\widetilde{t}\in\{1,\ldots,r-1\}$ , we can take a sequence $\{y^{n}\}_{n=1}^{\infty}\subset\mathbb{R}^{r!(r-1)^{2}}$ such that $y^{n}_{\widetilde{\pi},\widetilde{\pi}^{\prime},\widetilde{t}}=n(\phi_{\widetilde{\pi},\widetilde{t}}-\varphi_{\widetilde{\pi},\widetilde{\pi}^{\prime},\widetilde{t}})$ for $(\{\pi,\pi^{\prime}\},t)=(\{\widetilde{\pi},\widetilde{\pi}^{\prime}\},\widetilde{t})$ and $y^{n}_{\pi,\pi^{\prime},t}=0$ otherwise. For the sequences $\{y^{n}\}$ , we have

[TABLE]

from which we obtain $L^{+}(\phi,\varphi)=\sup_{y\in\mathbb{R}^{r!(r-1)^{2}}}\widetilde{L}_{0}(\phi,\varphi,y)=\infty$ . Therefore, the effective domain of $L^{+}(\phi,\varphi)$ is included in the bounded set $B$ and thus all level sets of $L^{+}$ are bounded.

We will show that all level sets of $L^{-}$ are bounded by showing that $L^{-}$ is coercive, i.e., for any sequence $\{y^{n}\}_{n=1}^{\infty}\subset\mathbb{R}^{r!(r-1)^{2}}$ satisfying $\lim_{n\rightarrow\infty}\|y^{n}\|_{2}=\infty$ , we have $\lim_{n\rightarrow\infty}L^{-}(y^{n})=\infty.$ For any given sequence $\{y^{n}\}_{n}$ satisfying $\lim_{n\rightarrow\infty}\|y^{n}\|_{2}=\infty$ , take sequences $\{\phi^{n}\}_{n}$ and $\{\varphi^{n}\}_{n}$ such that $\phi_{\pi,t}^{n}=1/(r-1)$ and $\varphi_{\pi,\pi^{\prime}}^{n}=\phi_{\pi}^{n}+y_{\pi,\pi^{\prime}}^{n}/\|y_{\pi,\pi^{\prime}}^{n}\|_{2}.$ For sequences $\{\phi^{n}\}_{n}$ and $\{\varphi^{n}\}_{n}$ , we obtain

[TABLE]

Hence, $L^{-}$ is coercive since we have $L^{-}(y^{n})\geq-\widetilde{L}_{0}(\phi^{n},\varphi^{n},y^{n})\geq-2r!(r-1)+\|y^{n}\|_{2}\rightarrow\infty\ (n\rightarrow\infty)$ for any sequence $\{y^{n}\}_{n=1}^{\infty}$ satisfying $\|y^{n}\|_{2}\rightarrow\infty$ , and thus all level sets of $L^{-}$ are bounded.

From the above, conditions (iv) and (v) are satisfied and thus we complete the proof. ∎

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Beckett (1993) L. Beckett. Maximum likelihood estimation in Mallows’ model using partially ranked data. In Probability models and statistical analyses for ranking data , pages 92–107. Springer, 1993.
2Bertsekas (2015) D. Bertsekas. Convex optimization algorithms . Athena Scientific Belmont, 2015.
3Boyd et al. (2011) S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine learning , 3:1–122, 2011.
4Busse et al. (2007) L. Busse, P. Orbanz, and J. Buhmann. Cluster analysis of heterogeneous rank data. In Proceedings of the 24th International Conference on Machine Learning , pages 113–120, 2007.
5Caron et al. (2014) F. Caron, Y. Teh, and T. Murphy. Bayesian nonparametric Plackett–Luce models for the analysis of preferences for college degree programmes. The Annals of Applied Statistics , 8:1145–1181, 2014.
6Cavanaugh and Shumway (1998) J. Cavanaugh and R. Shumway. An Akaike information criterion for model selection in the presence of incomplete data. Journal of Statistical Planning and Inference , 67:45–66, 1998.
7Diaconis (1989) P. Diaconis. A generalization of spectral analysis with application to ranked data. The Annals of Statistics , 17:949–979, 1989.
8Fahandar et al. (2017) M. Fahandar, E. Hüllermeier, and I. Couso. Statistical inference for incomplete ranking data: The case of rank-dependent coarsening. In Proceedings of the 34th International Conference on Machine Learning , pages 1078–1087, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning partially ranked data

Abstract.

Key words and phrases:

1. Introduction

1.1. Contribution

1.2. Literature review

1.3. Organization

2. Preliminaries

2.1. Notation

2.2. Partial rankings and a missing mechanism

2.3. Distances, graph structures, and distributions on complete rankings

3. Proposed Method

3.1. Estimators and estimation procedure

Remark 3.1** (Meaning of the penalty term).**

Remark 3.2** (The proposed method under the MAR assumption).**

3.2. Convergence

Proposition 3.1**.**

4. Numerical experiments

4.1. Performance measures

4.2. Method comparison

4.3. Stopping criteria

4.4. Simulation studies

4.4.1. Tilting the concentration parameter

4.4.2. Tilting the mixture coefficient

4.5. Application to real data

5. Conclusion

Appendix A Algorithms

Appendix B proof of proposition 3.1

Remark 3.1 (Meaning of the penalty term).

Remark 3.2 (The proposed method under the MAR assumption).

Proposition 3.1.