Nonparametric maximum likelihood estimation under a likelihood ratio   order

Ted Westling; Kevin J. Downes; Dylan S. Small

arXiv:1904.12321·stat.ME·July 8, 2021

Nonparametric maximum likelihood estimation under a likelihood ratio order

Ted Westling, Kevin J. Downes, Dylan S. Small

PDF

Open Access

TL;DR

This paper develops a nonparametric maximum likelihood estimator for two distributions under the likelihood ratio order, applicable to various distribution types, with theoretical convergence results and practical applications.

Contribution

It introduces a novel nonparametric estimator for distributions and their density ratio under likelihood ratio order, extending to discrete, continuous, and mixed cases.

Findings

01

Estimator converges in distribution in certain cases

02

Numerical experiments validate the estimator's performance

03

Application to biomarker data demonstrates practical utility

Abstract

Comparison of two univariate distributions based on independent samples from them is a fundamental problem in statistics, with applications in a wide variety of scientific disciplines. In many situations, we might hypothesize that the two distributions are stochastically ordered, meaning intuitively that samples from one distribution tend to be larger than those from the other. One type of stochastic order that arises in economics, biomedicine, and elsewhere is the likelihood ratio order, also known as the density ratio order, in which the ratio of the density functions of the two distributions is monotone non-decreasing. In this article, we derive and study the nonparametric maximum likelihood estimator of the individual distributions and the ratio of their densities under the likelihood ratio order. Our work applies to discrete distributions, continuous distributions, and mixed…

Equations107

\frac{f _{0} ( z )}{g _{0} ( z )} = \frac{[ d F _{0} / d H _{0} ] ( z )}{[ d G _{0} / d H _{0} ] ( z )} = \frac{P ( D = 1 ∣ Z = z ) / P ( D = 1 )}{P ( D = 0 ∣ Z = z ) / P ( D = 0 )} .

\frac{f _{0} ( z )}{g _{0} ( z )} = \frac{[ d F _{0} / d H _{0} ] ( z )}{[ d G _{0} / d H _{0} ] ( z )} = \frac{P ( D = 1 ∣ Z = z ) / P ( D = 1 )}{P ( D = 0 ∣ Z = z ) / P ( D = 0 )} .

F_{n}^{*} (x_{i}) - F_{n}^{*} (x_{i} -) = [F_{n}^{*} (y_{j}) - F_{n}^{*} (y_{j - 1})] \frac{F _{n} ( x _{i} ) - F _{n} ( x _{i} - )}{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )} .

F_{n}^{*} (x_{i}) - F_{n}^{*} (x_{i} -) = [F_{n}^{*} (y_{j}) - F_{n}^{*} (y_{j - 1})] \frac{F _{n} ( x _{i} ) - F _{n} ( x _{i} - )}{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )} .

F_{n}^{*} (x_{i}) - F_{n}^{*} (x_{i} -) = [1 - F_{n}^{*} (y_{m_{2}})] \frac{F _{n} ( x _{i} ) - F _{n} ( x _{i} - )}{1 - F _{n} ( y _{m_{2}} )} .

F_{n}^{*} (x_{i}) - F_{n}^{*} (x_{i} -) = [1 - F_{n}^{*} (y_{m_{2}})] \frac{F _{n} ( x _{i} ) - F _{n} ( x _{i} - )}{1 - F _{n} ( y _{m_{2}} )} .

{(G_{n} (y_{j}), F_{n} (y_{j})) : k = 0, \dots, m_{2}},

{(G_{n} (y_{j}), F_{n} (y_{j})) : k = 0, \dots, m_{2}},

P_{0} (Z \leq z, D = d) = d π_{0} F_{0} (z) + (1 - d) (1 - π_{0}) G_{0} (z) .

P_{0} (Z \leq z, D = d) = d π_{0} F_{0} (z) + (1 - d) (1 - π_{0}) G_{0} (z) .

σ_{0}^{2} (y_{j}) := θ_{0} (y_{j}) \frac{π _{0} F _{0, j} + ( 1 - π _{0} ) Δ G _{0} ( y _{j} ) - F _{0, j} Δ G _{0} ( y _{j} )}{π _{0} ( 1 - π _{0} ) [ Δ G _{0} ( y _{j} ) ] ^{2}},

σ_{0}^{2} (y_{j}) := θ_{0} (y_{j}) \frac{π _{0} F _{0, j} + ( 1 - π _{0} ) Δ G _{0} ( y _{j} ) - F _{0, j} Δ G _{0} ( y _{j} )}{π _{0} ( 1 - π _{0} ) [ Δ G _{0} ( y _{j} ) ] ^{2}},

n^{1/3} [μ_{n}^{*} (z) - μ_{0} (z)] ⟶ d {4 μ_{0}^{'} (z) μ_{0} (z) [1 - μ_{0} (z)] h_{0} (z)^{- 1}}^{1/3} W,

n^{1/3} [μ_{n}^{*} (z) - μ_{0} (z)] ⟶ d {4 μ_{0}^{'} (z) μ_{0} (z) [1 - μ_{0} (z)] h_{0} (z)^{- 1}}^{1/3} W,

n^{1/3} [θ_{n}^{*} (z) - θ_{0} (z)]

n^{1/3} [θ_{n}^{*} (z) - θ_{0} (z)]

κ_{0} (z) := θ_{0} (z) \frac{π _{0} f _{0} ( z ) + ( 1 - π _{0} ) g _{0} ( z )}{π _{0} ( 1 - π _{0} ) g _{0} ( z ) ^{2}} .

κ_{0} (z) := θ_{0} (z) \frac{π _{0} f _{0} ( z ) + ( 1 - π _{0} ) g _{0} ( z )}{π _{0} ( 1 - π _{0} ) g _{0} ( z ) ^{2}} .

F (A) G (B)

F (A) G (B)

= \int_{(x, y) \in A \times B} ν (x) d (G \times G) (x, y) .

\int_{(x, y) \in A \times B} ν (x) d (G \times G) (x, y) \leq \int_{(x, y) \in A \times B} ν (y) d (G \times G) (x, y) .

\int_{(x, y) \in A \times B} ν (x) d (G \times G) (x, y) \leq \int_{(x, y) \in A \times B} ν (y) d (G \times G) (x, y) .

\int_{(x, y) \in A \times B} ν (y) d (G \times G) (x, y) = \int_{x \in A} d G (x) \int_{y \in B} ν (y) d G (y) = G (A) F (B) .

\int_{(x, y) \in A \times B} ν (y) d (G \times G) (x, y) = \int_{x \in A} d G (x) \int_{y \in B} ν (y) d G (y) = G (A) F (B) .

[R_{F, G} (u) - R_{F, G} (t)] [λ (v - t)] = F (A) G (B) \leq F (B) G (A) = [(1 - λ) (v - t)] [R_{F, G} (v) - R_{F, G} (t)] .

[R_{F, G} (u) - R_{F, G} (t)] [λ (v - t)] = F (A) G (B) \leq F (B) G (A) = [(1 - λ) (v - t)] [R_{F, G} (v) - R_{F, G} (t)] .

t_{j} = \frac{\int _{w_{j}}^{y} ν ( u ) d G ( u )}{G ( y ) - G ( w _{j} )} = ν (y) + \frac{\int _{w_{j}}^{y} [ ν ( u ) - ν ( y )] d G ( u )}{G ( y ) - G ( w _{j} )} .

t_{j} = \frac{\int _{w_{j}}^{y} ν ( u ) d G ( u )}{G ( y ) - G ( w _{j} )} = ν (y) + \frac{\int _{w_{j}}^{y} [ ν ( u ) - ν ( y )] d G ( u )}{G ( y ) - G ( w _{j} )} .

i = 1 \prod n_{1} [F (X_{i}) - F (X_{i} -)] = j = 1 \prod m_{2} + 1 k \in I_{j} \prod [F (x_{k}) - F (x_{k} -)]^{r_{k}} .

i = 1 \prod n_{1} [F (X_{i}) - F (X_{i} -)] = j = 1 \prod m_{2} + 1 k \in I_{j} \prod [F (x_{k}) - F (x_{k} -)]^{r_{k}} .

F_{n}^{*} (x_{k}) - F_{n}^{*} (x_{k} -) = f_{j} \frac{r _{k}}{\sum _{l \in I_{j}} r _{l}}

F_{n}^{*} (x_{k}) - F_{n}^{*} (x_{k} -) = f_{j} \frac{r _{k}}{\sum _{l \in I_{j}} r _{l}}

{k = 1 \prod m_{2} + 1 f_{k}^{R_{k}}} {k = 1 \prod m_{2} g_{k}^{s_{k}}} = {k = 1 \prod m_{2} f_{k}^{R_{k}} g_{k}^{s_{k}}} f_{m_{2} + 1}^{R_{m_{2} + 1}}

{k = 1 \prod m_{2} + 1 f_{k}^{R_{k}}} {k = 1 \prod m_{2} g_{k}^{s_{k}}} = {k = 1 \prod m_{2} f_{k}^{R_{k}} g_{k}^{s_{k}}} f_{m_{2} + 1}^{R_{m_{2} + 1}}

\overset{ˉ}{L}_{n} (\overset{ˉ}{f}_{1}, \dots, \overset{ˉ}{f}_{m_{2}}, f_{m_{2} + 1}, g_{1}, \dots, g_{m_{2}}) := {k = 1 \prod m_{2} \overset{ˉ}{f}_{k}^{R_{k}} g_{k}^{s_{k}}} (1 - f_{m_{2} + 1})^{n_{1} - R_{m_{2} + 1}} f_{m_{2} + 1}^{R_{m_{2} + 1}}

\overset{ˉ}{L}_{n} (\overset{ˉ}{f}_{1}, \dots, \overset{ˉ}{f}_{m_{2}}, f_{m_{2} + 1}, g_{1}, \dots, g_{m_{2}}) := {k = 1 \prod m_{2} \overset{ˉ}{f}_{k}^{R_{k}} g_{k}^{s_{k}}} (1 - f_{m_{2} + 1})^{n_{1} - R_{m_{2} + 1}} f_{m_{2} + 1}^{R_{m_{2} + 1}}

\overset{ˉ}{L}_{n} (ρ, σ) = k = 1 \prod m_{2} [ρ_{k} σ_{k} / \overset{n}{ˉ}_{1}]^{R_{k}} [(1 - ρ_{k}) σ_{k} / n_{2}]^{s_{k}} = \overset{n}{ˉ}_{1}^{- \overset{n}{ˉ}_{1}} n_{2}^{- n_{2}} k = 1 \prod m_{2} ρ_{k}^{R_{k}} (1 - ρ_{k})^{s_{k}} k = 1 \prod m_{2} σ_{k}^{R_{k} + s_{k}}

\overset{ˉ}{L}_{n} (ρ, σ) = k = 1 \prod m_{2} [ρ_{k} σ_{k} / \overset{n}{ˉ}_{1}]^{R_{k}} [(1 - ρ_{k}) σ_{k} / n_{2}]^{s_{k}} = \overset{n}{ˉ}_{1}^{- \overset{n}{ˉ}_{1}} n_{2}^{- n_{2}} k = 1 \prod m_{2} ρ_{k}^{R_{k}} (1 - ρ_{k})^{s_{k}} k = 1 \prod m_{2} σ_{k}^{R_{k} + s_{k}}

k = 1 \sum m_{2} [R_{k} lo g ρ_{k} + s_{k} lo g (1 - ρ_{k})] = k = 1 \sum m_{2} w_{k} [t_{k} lo g ρ_{k} + (1 - t_{k}) lo g (1 - ρ_{k})]

k = 1 \sum m_{2} [R_{k} lo g ρ_{k} + s_{k} lo g (1 - ρ_{k})] = k = 1 \sum m_{2} w_{k} [t_{k} lo g ρ_{k} + (1 - t_{k}) lo g (1 - ρ_{k})]

{(0, 0)} \cup {(j = 1 \sum k w_{k}, j = 1 \sum k t_{j} w_{j}) : k = 1, \dots, m_{2}}

{(0, 0)} \cup {(j = 1 \sum k w_{k}, j = 1 \sum k t_{j} w_{j}) : k = 1, \dots, m_{2}}

= {(n_{1} F_{n} (y_{k}) + n_{2} G_{n} (y_{k}), n_{1} F_{n} (y_{k})) : k = 0, \dots, m_{2}}

δ := min {\frac{F _{0} ( y _{j + 1} ) - F _{0} ( y _{j} )}{Δ G _{0} ( y _{j + 1} )} - \frac{F _{0} ( y _{j} ) - F _{0} ( y _{j - 1} )}{Δ G _{0} ( y _{j} )} : j = 1, \dots, m_{2} - 1},

δ := min {\frac{F _{0} ( y _{j + 1} ) - F _{0} ( y _{j} )}{Δ G _{0} ( y _{j + 1} )} - \frac{F _{0} ( y _{j} ) - F _{0} ( y _{j - 1} )}{Δ G _{0} ( y _{j} )} : j = 1, \dots, m_{2} - 1},

\frac{F _{n} ( y _{j + 1} ) - F _{n} ( y _{j} )}{Δ G _{n} ( y _{j + 1} )} - \frac{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )}{Δ G _{n} ( y _{j} )}

\frac{F _{n} ( y _{j + 1} ) - F _{n} ( y _{j} )}{Δ G _{n} ( y _{j + 1} )} - \frac{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )}{Δ G _{n} ( y _{j} )}

+ \frac{[ F _{n} ( y _{j + 1} ) - F _{0} ( y _{j + 1} )] - [ F _{n} ( y _{j} ) - F _{0} ( y _{j} )]}{Δ G _{0} ( y _{j + 1} )}

+ [F_{n} (y_{j + 1}) - F_{n} (y_{j})] [\frac{1}{Δ G _{n} ( y _{j + 1} )} - \frac{1}{Δ G _{0} ( y _{j + 1} )}]

- \frac{[ F _{n} ( y _{j} ) - F _{0} ( y _{j} )] - [ F _{n} ( y _{j - 1} ) - F _{0} ( y _{j - 1} )]}{Δ G _{0} ( y _{j} )}

- [F_{n} (y_{j}) - F_{n} (y_{j - 1})] [\frac{1}{Δ G _{n} ( y _{j} )} - \frac{1}{Δ G _{0} ( y _{j} )}] .

min {\frac{F _{n} ( y _{j + 1} ) - F _{n} ( y _{j} )}{Δ G _{n} ( y _{j + 1} )} - \frac{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )}{Δ G _{n} ( y _{j} )} : j = 1, \dots, m_{2} - 1} \geq δ - o_{P} (1),

min {\frac{F _{n} ( y _{j + 1} ) - F _{n} ( y _{j} )}{Δ G _{n} ( y _{j + 1} )} - \frac{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )}{Δ G _{n} ( y _{j} )} : j = 1, \dots, m_{2} - 1} \geq δ - o_{P} (1),

\frac{F _{n} ( y _{j + 1} ) - F _{n} ( y _{j} )}{Δ G _{n} ( y _{j + 1} )} \geq \frac{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )}{Δ G _{n} ( y _{j} )}

\frac{F _{n} ( y _{j + 1} ) - F _{n} ( y _{j} )}{Δ G _{n} ( y _{j + 1} )} \geq \frac{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )}{Δ G _{n} ( y _{j} )}

θ_{n}^{*} (y_{j}) = [\partial_{-} GCM_{[0, 1]} (R_{F_{n}, G_{n}})] \circ G_{n} (y_{j}) = \frac{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )}{Δ G _{n} ( y _{j} )}

θ_{n}^{*} (y_{j}) = [\partial_{-} GCM_{[0, 1]} (R_{F_{n}, G_{n}})] \circ G_{n} (y_{j}) = \frac{F _{n} ( y _{j} ) - F _{n} ( y _{j - 1} )}{Δ G _{n} ( y _{j} )}

n^{1/2} [θ_{n}^{*} (y_{j}) - θ_{0} (y_{j})]

n^{1/2} [θ_{n}^{*} (y_{j}) - θ_{0} (y_{j})]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Distribution Estimation and Applications · Bayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference

Full text

Nonparametric maximum likelihood estimation

under a likelihood ratio order

Ted Westling1

Kevin J. Downes2,3,4

Dylan S. Small5

(1Department of Mathematics and Statistics, University of Massachusetts Amherst

2Center for Pediatric Clinical Effectiveness, Children’s Hospital of Philadelphia

3Division of Infectious Diseases, Children’s Hospital of Philadelphia

4Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania

5Department of Statistics, The Wharton School, University of Pennsylvania)

Abstract

Comparison of two univariate distributions based on independent samples from them is a fundamental problem in statistics, with applications in a variety of scientific disciplines. In many situations, we might hypothesize that the two distributions are stochastically ordered, meaning that samples from one distribution tend to be larger than those from the other. One type of stochastic order is the likelihood ratio order, in which the ratio of the density functions of the two distributions is monotone non-decreasing. In this article, we derive and study the nonparametric maximum likelihood estimator of the individual distribution functions and the ratio of their densities under the likelihood ratio order. Our work applies to discrete distributions, continuous distributions, and mixed continuous-discrete distributions. We demonstrate convergence in distribution of the estimator in certain cases, and we illustrate our results using numerical experiments and an analysis of a biomarker for predicting bacterial infection in children with systemic inflammatory response syndrome.

1 Introduction

Comparing the distributions of two independent samples is a fundamental problem in statistics. Suppose that $X_{1},\dotsc,X_{n_{1}}$ and $Y_{1},\dotsc,Y_{n_{2}}$ are independent real-valued samples with distribution functions $F_{0}$ and $G_{0}$ , respectively. In many situations, we might hypothesize that $F_{0}$ and $G_{0}$ are stochastically ordered, meaning intuitively that samples from $F_{0}$ tend to be larger than those from $G_{0}$ . A particular type of stochastic order that arises in many applications is the likelihood ratio order. Specifically, $G_{0}$ and $F_{0}$ satisfy a likelihood ratio order if the density ratio $f_{0}/g_{0}$ is monotone non-decreasing over the support $\mathscr{G}_{0}$ of $G_{0}$ , where $f_{0}:=dF_{0}/d\eta$ and $g_{0}:=dG_{0}/d\eta$ for some dominating measure $\eta$ . For this reason, the likelihood ratio order is also called a density ratio order.

A likelihood ratio order can arise for a variety of scientific reasons (Beare and Moon, 2015; Roosen and Hennessy, 2004; Dykstra et al., 1995; Yu et al., 2017). In the biomedical sciences and elsewhere, the ratio of two density functions is an object of interest for describing the relative likelihood of a binary status indicator conditional on a covariate. If $D$ is a binary random variable, $Z$ is a scalar random variable, $F_{0}(z)=P(Z\leq z\mid D=1)$ , $G_{0}(z)=P(Z\leq z\mid D=0)$ , and $H_{0}(z):=P(Z\leq z)$ , then

[TABLE]

Therefore, the density ratio in this context may be interpreted as the relative odds of $D=1$ given $Z=z$ to the overall odds of $D=1$ . Since the transformation $x\mapsto x/(1-x)$ is strictly increasing, monotonicity of the density ratio is equivalent to monotonicity of $z\mapsto P(D=1\mid Z=z)$ . One specific situation in which the representation given in (1) is of scientific interest is biomarker evaluation, wherein $D$ represents infection status and $Z$ represents the value of a biomarker. Equation (1) implies that the ratio of the densities of biomarker values among infected patients to the same among uninfected patients can be interpreted as the odds ratio of infection given biomarker level relative to overall odds of infection. Monotonicity of the density ratio corresponds to the assumption that the conditional probability of infection given biomarker level increases with biomarker level, which is reasonable if the biomarker is actually predictive of disease.

In this article, we derive the nonparametric maximum likelihood estimators of $F_{0}$ , $G_{0}$ , and $\theta_{0}=f_{0}/g_{0}$ under the likelihood ratio order restriction and derive certain asymptotic properties of these estimators, including consistency and convergence in distribution. In particular, we use a connection between estimation of $\theta_{0}$ and the classical isotonic regression problem with a binary outcome, which both simplifies the derivation of large-sample results and suggests that existing inference methods for the isotonic regression problem can be used to perform inference for $\theta_{0}$ as well. Our results generalize those of Dykstra et al. (1995), who derived the maximum likelihood estimator of $F_{0}$ and $G_{0}$ under a likelihood ratio order in the special case where $F_{0}$ and $G_{0}$ are discrete distributions. We will illustrate our results using numerical experiments and an analysis of a biomarker for predicting bacterial infection in children with systemic inflammatory response syndrome.

Recently, Yu et al. (2017) considered estimation of a monotone density ratio and the individual density functions by maximizing a smoothed likelihood function, and demonstrated certain asymptotic properties of their estimator. Yu et al. (2017) considered maximizing a smoothed likelihood rather than maximizing the likelihood directly because the maximum likelihood estimator of the individual densities does not exist. In contrast, we will show that, using a definition of the likelihood ratio ordered model based on convexity of the ordinal dominance curve, a well-defined nonparametric maximum likelihood estimator of the monotone density ratio function and the individual distribution functions (rather than the density functions) does exist. Furthermore, unlike the smoothed estimator, the derivation of the maximum likelihood estimator does not require selection of a bandwidth or any other tuning parameter, and does not rely on the existence of Lebesgue density functions.

Additional relevant references include: Lehmann and Rojo (1992) and Shaked and Shanthikumar (2007), which contain more examples and details regarding stochastic orders, Carolan and Tebbs (2005) and Beare and Moon (2015), which studied tests of the likelihood ratio order, and Rojo and Samaniego (1991), Rojo and Samaniego (1993), Mukerjee (1996), Arcones and Samaniego (2000), Davidov and Herman (2012), and Tang et al. (2017), which considered testing and estimation under other stochastic orders.

2 Likelihood ratio orders

We observe two independent real-valued samples $X_{1},\dotsc,X_{n_{1}}$ and $Y_{1}$ , $\dotsc,$ $Y_{n_{2}}$ with distribution functions $F_{0}$ and $G_{0}$ , respectively. We define $\mathscr{F}_{0}$ as the support of $F_{0}$ and $\mathscr{G}_{0}$ as the support of $G_{0}$ . We denote $n:=n_{1}+n_{2}$ , and by $F_{n}$ and $G_{n}$ the empirical distribution functions of $X_{1},\dotsc,X_{n_{1}}$ and $Y_{1},\dotsc,Y_{n_{2}}$ , respectively. We define $x_{1}<\cdots<x_{m_{1}}$ as the unique values of $X_{1},\dotsc,X_{n_{1}}$ , $y_{1}<\cdots<y_{m_{2}}$ as the unique values of $Y_{1},\dotsc,Y_{n_{2}}$ , and $z_{1}<z_{2}<\cdots<z_{m}$ as the unique values of $(X_{1},\dotsc,X_{n_{1}},Y_{1},\dotsc,Y_{n_{2}})$ . Throughout, we assume that $n$ is fixed, but that $n_{1}$ is drawn from a Binomial $(n,\pi_{0})$ distribution for some $\pi_{0}\in(0,1)$ .

We let $\mathscr{D}$ be the space of distribution functions on $\mathbb{R}$ ; i.e. all non-decreasing, cádlàg functions $H$ such that $\lim_{x\to-\infty}H(x)=0$ and $\lim_{x\to\infty}H(x)=1$ . For any nondecreasing function $h:\mathbb{R}\to\mathbb{R}$ , we define its generalized-inverse $h^{-}$ pointwise as $h^{-}(u):=\inf\{x:h(x)\geq u\}$ . When $h\in\mathscr{D}$ , $h^{-}$ is called the quantile function of $h$ . For any interval $I\subseteq\mathbb{R}$ and any function $h:I\to\mathbb{R}$ , we define the greatest convex minorant (GCM) of $h$ on $I$ , denoted $\mathrm{GCM}_{I}(h):I\to\overline{\mathbb{R}}$ , for $\overline{\mathbb{R}}$ the extended real line, as the pointwise supremum of all convex functions on $I$ bounded above by $h$ . The least concave majorant operator is defined analogously. We say a function $H$ is convex over a set $\mathscr{S}\subseteq\mathbb{R}$ if for every $x,y\in\mathscr{S}$ and $\lambda\in[0,1]$ such that $\lambda x+(1-\lambda)y\in\mathscr{S}$ , $H(\lambda x+(1-\lambda)y)\leq\lambda H(x)+(1-\lambda)H(y)$ . We also define $\partial_{-}$ as the left derivative operator for a left differentiable function and $\mathrm{Im}(h):=\{h(x):x\in\mathscr{S}\}$ as the image of a function $h$ defined on a domain $\mathscr{S}$ .

The unrestricted nonparametric model for the pair $(F,G)$ of distribution functions of the observed data is $\mathscr{M}_{NP}:=\mathscr{D}^{2}$ . As mentioned in the introduction, the likelihood ratio order can be defined as the ratio of the density functions $f_{0}$ and $g_{0}$ of $F_{0}$ and $G_{0}$ with respect to some dominating measure $\eta$ being non-decreasing. By varying the dominating measure, both discrete and continuous distributions can be handled this way. However, as noted by Yu et al. (2017), this definition does not lend itself to the derivation of a maximum likelihood estimator, since the likelihood defined through the densities can be made arbitrarily large. Instead, other authors have defined the likelihood ratio order as convexity of the ordinal dominance curve, defined as $t\mapsto R_{F,G}(t):=F\circ G^{-}(t)$ for $t\in[0,1]$ (Bamber, 1975; Hsieh and Turnbull, 1996). Lehmann and Rojo (1992) demonstrated the equivalence of this definition to that using the density functions in the special case that $F$ and $G$ are strictly increasing and continuous on their supports, which were assumed to be intervals. Alternatively, Shaked and Shanthikumar (2007) defined the likelihood ratio order as $F(A)G(B)\leq F(B)G(A)$ for all measurable sets $A,B\subseteq\mathbb{R}$ with $A\leq B$ , where $F(A):=\int_{A}\,dF$ (with some abuse of notation) and $A\leq B$ means that $a\leq b$ for all $a\in A$ and $b\in B$ .

In Theorem 1 below, we consolidate and generalize existing results connecting these different definitions of the likelihood ratio order.

Theorem 1.

If $F\ll G$ and $\nu:=dF/dG$ is continuous on the support $\mathscr{G}$ of $G$ , then (1) the following are equivalent: $R_{F,G}$ is convex on $\mathrm{Im}(G)$ , $\nu$ is non-decreasing on $\mathscr{G}$ , and $F(A)G(B)\leq F(B)G(A)$ for all measurable sets $A\leq B$ ; and (2) if $\nu$ is non-decreasing on $\mathscr{G}$ then $\nu(x)=\partial_{-}\mathrm{GCM}_{[0,1]}(R_{F,G})\circ G(x)$ for all $x\in\mathscr{G}$ .

To our knowledge, Theorem 1 is the most general result to-date connecting the three definitions of the likelihood ratio ordered model. We note that the three definitions may not be equivalent when $F$ is not dominated by $G$ or $\nu$ is not continuous. For instance, in the proof of Theorem 1 part (1), we only use the assumption that $\nu$ is continuous on $\mathscr{G}$ to show that $R_{F,G}$ is convex on $\mathrm{Im}(G)$ implies that $\nu$ is non-decreasing, but we show that if $\nu$ is non-decreasing, then $F(A)G(B)\leq F(B)G(A)$ for all $A\leq B$ (the definition used in Shaked and Shanthikumar, 2007) regardless of whether $\nu$ is continuous. Additionally, we show that $F(A)G(B)\leq F(B)G(A)$ for all $A=(a_{1},b_{1}]\leq B=(b_{1},b_{2}]$ implies that $R_{F,G}$ is convex on $\mathrm{Im}(G)$ regardless of whether $F\ll G$ or $\nu$ is continuous. However, to show the converse, we use $F\ll G$ . For a simple counterexample when $F$ is not dominated by $G$ , consider $F(\{a\})=1$ and $G(\{b\})=1$ , where $a<b$ . Then $R_{F,G}(u)=I(u>0)$ for $u\in[0,1]$ , which is convex on $\mathrm{Im}(G)=\{0,1\}$ , but $1=F(A)G(B)>F(B)G(A)=0$ for $A=\{a\}\leq\{b\}=B$ . Finally, when $F\ll G$ but $dF/dG$ is not continuous on $\mathscr{G}$ , whether $R_{F,G}$ being convex on $\mathrm{Im}(G)$ implies that $F(A)G(B)\leq F(B)G(A)$ for all measurable sets $A\leq B$ , or even all such Borel sets, is unclear to us.

Throughout the remainder of the article, we say $(F,G)\in\mathscr{M}_{NP}$ satisfy a likelihood ratio order, and write $G\leq_{LR}F$ if $R_{F,G}$ is convex on $\mathrm{Im}(G)$ . We then define the likelihood ratio ordered model $\mathscr{M}_{LR}$ as all $(F,G)\in\mathscr{M}_{NP}$ such that $G\leq_{LR}F$ . For any $(F,G)\in\mathscr{M}_{NP}$ , we further define $\theta:\mathscr{M}_{NP}\to\boldsymbol{\Theta}$ as $\theta_{F,G}:=\partial_{-}\mathrm{GCM}_{[0,1]}(R_{F,G})\circ G$ , where $\boldsymbol{\Theta}$ is defined as the set of non-negative, non-decreasing functions on $\mathbb{R}$ . We note that this definition allows for the possibility that $F$ is not dominated by $G$ , but by Theorem 1, for all $(F,G)\in\mathscr{M}_{LR}$ such that $F\ll G$ and $dF/dG$ is continuous on $\mathscr{G}$ , $\theta_{F,G}=dF/dG$ on $\mathscr{G}$ . We define $\theta_{0}:=\theta_{F_{0},G_{0}}$ .

In the context of the likelihood ratio order, many existing works either assume that $F_{0}$ and $G_{0}$ are discrete (e.g. Dykstra et al., 1995) or that $F_{0}$ and $G_{0}$ are continuous (e.g. Lehmann and Rojo, 1992; Yu et al., 2017). In the discrete setting, if $F_{0}$ and $G_{0}$ are discrete distributions with common support and mass functions $\Delta F_{0}$ and $\Delta G_{0}$ such that $(F_{0},G_{0})\in\mathscr{M}_{LR}$ , then $\theta_{0}=\Delta F_{0}/\Delta G_{0}$ on $\mathscr{G}_{0}$ . Alternatively, if $F_{0}$ and $G_{0}$ both possess Lebesgue density functions $f_{0}$ and $g_{0}$ and $(F_{0},G_{0})\in\mathscr{M}_{LR}$ , then $\theta_{0}=f_{0}/g_{0}$ on $\mathscr{G}_{0}$ . However, for the purpose of deriving a maximum likelihood estimator, we will demonstrate that these two cases do not need to be treated separately. Furthermore, in some applied settings, $F_{0}$ and $G_{0}$ are neither discrete nor continuous, but rather a mixture of discrete and continuous components, and we will derive results that apply in these situations as well. For instance, exposures that are bounded below may have positive mass at their lower boundary, and be continuous thereafter. Many biomarkers exhibit this property. Similarly, some measurements are “clumpy”, exhibiting positive mass at integers or other “round” numbers due to the measurement process, but also possessing positive Lebesgue density between such points. In all cases, $\theta_{0}$ has a meaningful interpretation as the ratio of the conditional odds of a sample being from the distribution $F_{0}$ to the unconditional odds of a sample being from $F_{0}$ .

3 Estimation under a likelihood ratio order

3.1 Maximum likelihood estimator

The pair $(F_{0},G_{0})$ determines the joint distribution of the observed data. Defining the nonparametric likelihood of the observed data as $L_{n}(F,G):=\prod_{i=1}^{n_{1}}\left[F(X_{i})-F(X_{i}-)\right]\prod_{j=1}^{n_{2}}\left[G(Y_{j})-G(Y_{j}-)\right]$ , the nonparametric maximum likelihood estimator of $(F_{0},G_{0})$ , i.e. in the model $\mathscr{M}_{NP}$ , is $(F_{n},G_{n})$ for $F_{n}$ the empirical distribution function of $X_{1},\dotsc,X_{n_{1}}$ , and $G_{n}$ the same of $Y_{1},\dotsc,Y_{n_{2}}$ . This suggests taking as an estimator of $\theta_{0}$ the plug-in estimator $\theta_{n}:=\theta_{F_{n},G_{n}}=\partial_{-}\mathrm{GCM}_{[0,1]}(F_{n}\circ G_{n}^{-})\circ G_{n}$ . The function $F_{n}\circ G_{n}^{-}$ is known as the empirical ordinal dominance curve, and is properties were studied by Hsieh and Turnbull (1996).

In this section, we demonstrate, amongst other results, that $\theta_{n}$ is the maximum likelihood estimator of $\theta_{0}$ in the likelihood ratio ordered model $\mathscr{M}_{LR}$ . A maximum likelihood estimator of $(F_{0},G_{0})$ in $\mathscr{M}_{LR}$ is defined as $(F_{n}^{*},G_{n}^{*})\in\operatorname*{argmax}_{(F,G)\in\mathscr{M}_{LR}}L_{n}(F,G)$ , and a maximum likelihood estimator of $\theta_{0}$ is defined as $\theta_{n}^{*}:=\theta_{F_{n}^{*},G_{n}^{*}}$ .

We define $H_{n}(z):=\pi_{n}F_{n}(z)+(1-\pi_{n})G_{n}(z)$ as the empirical distribution of the combined sample $X_{1},\dotsc,X_{n_{1}},Y_{1},\dotsc,Y_{n_{2}}$ , and $h_{k}:=H_{n}(y_{k})$ for $k=1,\dotsc,m_{2}$ . Our first result characterizes $(F_{n}^{*},G_{n}^{*})$ .

Theorem 2.

Let $A_{k}^{*}$ be the value at $h_{k}$ of the GCM over $[0,h_{m_{2}}]$ of $\{(h_{k},F_{n}(y_{k})):k=0,\dotsc,m_{2}\}$ and $B_{k}^{*}$ be the value at $h_{k}$ of the LCM over $[0,h_{m_{2}}]$ of $\left\{\left(h_{k},G_{n}(y_{k})\right):k=0,\dotsc,m_{2}\right\}$ . Then $G_{n}^{*}$ is a right-continuous step function with jumps at $y_{1},\dotsc,y_{m_{2}}$ with $G_{n}^{*}(y_{k})=B_{k}^{*}$ and $F_{n}^{*}$ is given by a right-continuous step function with jumps at $z_{1},\dotsc,z_{m}$ , where $F_{n}^{*}(y_{k})=A_{k}^{*}$ , and for any $x_{i}$ such that $y_{j-1}<x_{i}\leq y_{j}$ , where $y_{0}:=-\infty$ , the mass of $F_{n}^{*}$ at $x_{i}$ is given by

[TABLE]

For any $x_{i}$ such that $y_{m_{2}}<x_{i}$ , the mass of $F_{n}^{*}$ at $x_{i}$ is given by

[TABLE]

We also note that $F_{n}^{*}(y_{k})=\mathrm{GCM}_{[0,h_{m_{2}}]}(F_{n}\circ H_{n}^{-})(H_{n}(y_{k}))$ and $G_{n}^{*}(y_{k})=\mathrm{LCM}_{[0,h_{m_{2}}]}(G_{n}\circ H_{n}^{-})(H_{n}(y_{k}))$ .

A proof of Theorem 2, and proofs of all other theorems, are provided in Supplementary Material. We note that $F_{n}^{*}$ necessarily has jumps at all $x_{i}$ and at all $y_{j}$ such that $y_{j}\geq x_{1}$ , and $G_{n}^{*}$ has jumps at all $y_{j}$ . We also note that if there are $j$ such that no $x_{i}\in(y_{j},y_{j+1}]$ but $F_{n}^{*}(y_{j})>F_{n}^{*}(y_{j-1})$ , then there are infinitely many maximizers $F_{n}^{*}$ because any $F_{n}^{*}$ that assigns mass $F_{n}^{*}(y_{j})-F_{n}^{*}(y_{j-1})$ to the interval $(y_{j},y_{j+1}]$ yields the same likelihood and satisfies the constraints. In these cases, for the sake of uniqueness, we will put mass $F_{n}^{*}(y_{j})-F_{n}^{*}(y_{j-1})$ at the point $y_{j+1}$ .

Theorem 2 implies the following result characterizing $\theta_{n}^{*}$ .

Corollary 1.

The points $\{(G_{n}^{*}(y_{k}),F_{n}^{*}(y_{k})):k=1,\dotsc,m_{2}\}$ lie on the GCM over $[0,1]$ of the empirical ordinal dominance curve

[TABLE]

where $y_{0}:=-\infty$ . Specifically, if $\left\{\left(h_{j_{k}},F_{n}(y_{j_{k}})\right):k=0,\dotsc,K\right\}$ are the vertices of the GCM of $\left\{\left(h_{k},F_{n}(y_{k})\right):k=0,\dotsc,m_{2}\right\}$ , then $(G_{n}(y_{j_{k}}),F_{n}(y_{j_{k}})):k=0,\dotsc,K\}$ are the vertices of the GCM of the empirical ordinal dominance curve. Therefore, $\theta_{n}^{*}:=\theta_{F_{n}^{*},G_{n}^{*}}$ is equal to $\theta_{n}:=\theta_{F_{n},G_{n}}$ .

Theorem 2 bears resemblance to, but is different than, Theorem 2.1 of Dykstra et al. (1995), which characterized the maximum likelihood estimator under a likelihood ratio order in the discrete case. Here, we perform the maximization over all pairs of univariate distribution functions $(F,G)$ such that $R_{F,G}=F\circ G^{-}$ is convex on the support of $G$ , whereas Theorem 2.1 of Dykstra et al. (1995) performed the maximization over $(F,G)$ with support contained in $\{z_{1},\dotsc,z_{m}\}$ and such that $[\Delta F(z_{j})]/[\Delta G(z_{j})]$ is nondecreasing. The first set is strictly larger than the second, which results in possibly different maximum likelihood estimators. In particular, our maximum likelihood estimator $G_{n}^{*}$ is only supported on $y_{1},\dotsc,y_{m_{2}}$ , whereas the maximum likelihood estimator of $G_{0}$ derived by Dykstra et al. (1995) may have support on $x_{j}$ that are not equal to any $y_{1},\dotsc,y_{m_{2}}$ . This difference makes sense in the context of our respective problem formulations: Dykstra et al. (1995) assumed that the supports of $F_{0}$ and $G_{0}$ are subsets of $\{z_{1},\dotsc,z_{m}\}$ , while we do not assume the supports are known a priori. In Supplementary Material, we illustrate the use of Theorem 2 using hypothetical data in which our maximum likelihood estimators $F_{n}^{*}$ and $G_{n}^{*}$ are different from those of Dykstra et al. (1995).

3.2 Representation as a transformation of isotonic regression

Dykstra et al. (1995) and Carolan and Tebbs (2005) provided representations of the maximum likelihood estimators of $F_{0}$ and $G_{0}$ in terms of isotonic regression in the discrete and continuous cases, respectively. Here, we show that $\theta_{n}^{*}$ can also be represented as a transformation of an isotonic regression, which aids in deriving its asymptotic properties. We let $D_{1},\dotsc,D_{n}$ be independent Bernoulli random variables with common probability $\pi_{0}$ and such that $n_{1}=\sum_{i=1}^{n}D_{i}$ . Letting $j_{1},\dotsc,j_{n_{1}}$ be the indices such that $D_{j_{i}}=1$ for each $i$ , we then define $Z_{j_{i}}:=X_{i}$ for each $i=1,\dotsc,n_{1}$ . Similarly, letting $k_{1},\dotsc,k_{n_{2}}$ be the indices such that $D_{k_{i}}=0$ for each $i$ , we define $Z_{k_{i}}:=Y_{i}$ for each $i=1,\dotsc,n_{2}$ . Defining the data unit $\mathbf{O}_{i}:=(Z_{i},D_{i})$ , observing the independent samples $X_{1},\dotsc,X_{n_{1}}$ from $F_{0}$ and $Y_{1},\dotsc,Y_{n_{2}}$ from $G_{0}$ is then equivalent to observing independent observations $\mathbf{O}_{1},\dotsc,\mathbf{O}_{n}$ from $P_{0}$ , where $P_{0}$ satisfies

[TABLE]

Thus, $Z_{1},\dotsc,Z_{n}$ represent the pooled values of $X_{1},\dotsc,X_{n_{1}},Y_{1},\dotsc,Y_{n_{2}}$ , and each $D_{i}$ represents an indicator that $Z_{i}$ corresponds to a sample from $F_{0}$ . Furthermore, $F_{0}(z)=P_{0}(Z\leq z\mid D=1)$ , $G_{0}(z)=P_{0}(Z\leq z\mid D=0)$ , and $\pi_{0}:=P_{0}(D=1)$ . Estimating $\theta_{0}$ given the independent samples $X_{1},\dotsc,X_{n_{1}}$ and $Y_{1},\dotsc,Y_{n_{2}}$ is therefore equivalent to estimating $\theta_{0}$ given independent observations $\mathbf{O}_{1},\dotsc,\mathbf{O}_{n}$ from $P_{0}$ , where $n_{1}:=\sum_{i=1}^{n}D_{i}$ .

The benefit to the above reframing of the problem is that $\theta_{0}$ , $F_{0}$ , and $G_{0}$ can then be written as transformations of $P_{0}$ . First, we have that $\theta_{0}(z)=T(\mu_{0}(z))/T(\pi_{0})$ , where $\mu_{0}(z):=P_{0}(D=1\mid Z=z)$ and $T:[0,1)\to\mathbb{R}^{+}$ is the odds transformation, defined as $T(\mu):=\mu/(1-\mu)$ . Since $T$ is strictly increasing, $\theta_{0}$ is monotone if and only if $\mu_{0}$ is. Since the maximum likelihood estimator of $\mu_{0}$ under the assumption that $\mu_{0}$ is non-decreasing is given by the isotonic regression $\mu_{n}^{*}$ of $D_{1},\dotsc,D_{n}$ on $Z_{1},\dotsc,Z_{n}$ , and the maximum likelihood estimator of $\pi_{0}$ is given by $\pi_{n}$ , the maximum likelihood estimator of $\theta_{0}(z)$ is then given by $T(\mu_{n}^{*}(z))/T(\pi_{n})$ . It is straightforward to see that this form of the maximum likelihood estimator is equivalent to the forms given above. In the next section, we will utilize this form of $\theta_{n}^{*}$ to derive its asymptotic properties and to construct asymptotic confidence intervals.

4 Asymptotic results

4.1 Discrete distributions

We first consider the situation where $G_{0}$ has finite support $\mathscr{G}_{0}$ and $\theta_{0}$ is strictly increasing on $\mathscr{G}_{0}$ . The next result demonstrates that in this case, $F_{n}^{*}$ and $G_{n}^{*}$ are asymptotically equivalent to $F_{n}$ and $G_{n}$ , respectively, and $\theta_{n}^{*}$ is asymptotically equivalent to the ratio of the empirical masses on the support of $G_{0}$ .

Theorem 3 (Discrete distributions).

Suppose that the support $\mathscr{G}$ of $G_{0}$ is a finite set $\{y_{1}<y_{2}<\cdots<y_{m_{2}}\}$ and that $[F_{0}(y_{j})-F_{0}(y_{j-1})]/\Delta G_{0}(y_{j})<[F_{0}(y_{j+1})-F_{0}(y_{j})]/\Delta G_{0}(y_{j+1})$ for $j=1,\dotsc,m_{2}-1$ , where $y_{0}:=-\infty$ . Then $F_{n}^{*}=F_{n}$ and $G_{n}^{*}=G_{n}$ with probability tending to one, so that with probability tending to one $\theta_{n}^{*}$ is a left-continuous step function with jumps at $y_{1},\dotsc,y_{m_{2}-1}$ and $\theta_{n}^{*}(y_{j})=[F_{n}(y_{j})-F_{n}(y_{j-1})]/\Delta G_{n}(y_{j})$ and $\theta_{n}^{*}(z)=0$ for $z<y_{1}$ . As a result, $n^{1/2}[\theta_{n}^{*}(y_{j})-\theta_{0}(y_{j})]\operatorname*{\stackrel{{\scriptstyle d}}{{\longrightarrow}}}N(0,\sigma_{0}^{2}(y_{j}))$ for

[TABLE]

where $F_{0,j}:=F_{0}(y_{j})-F_{0}(y_{j-1})$ .

We note that the above result does not require that $F_{0}$ be discrete as well, or be dominated by $G_{0}$ . If $F_{0}$ is dominated by $G_{0}$ , then $\theta_{0}=\Delta F_{0}/\Delta G_{0}$ corresponds to the ratio of the mass functions.

4.2 Continuous distributions

Now we address the situation where $F_{0}$ and $G_{0}$ are both absolutely continuous on $\mathscr{G}_{0}$ and $\theta_{0}$ , which now corresponds to the ratio $f_{0}/g_{0}$ of the density functions, is strictly increasing. We first consider the large-sample behavior of $F_{n}^{*}$ and $G_{n}^{*}$ .

Theorem 4.

Suppose that $G_{0}$ is supported on a bounded interval $[a,b]\subset\mathbb{R}$ , that $F_{0}$ and $G_{0}$ possess continuous density functions $f_{0}$ and $g_{0}$ on $[a,b]$ such that $f_{0}/g_{0}$ is strictly increasing on $[a,b]$ , and $g_{0}(z)\geq\kappa>0$ on $[a,b]$ . Then $\|G_{n}^{*}-G_{n}\|_{\infty}=o_{P}(n^{-1/2})$ and $\|F_{n}^{*}-F_{n}\|_{\infty}=o_{P}(n^{-1/2})$ .

Theorem 4 demonstrates that when $\theta_{0}$ is strictly increasing, the maximum likelihood estimators of the individual distribution functions are asymptotically equivalent to the empirical distribution functions at the rate $n^{-1/2}$ , and hence possess the same limit distributions as the empirical distribution functions. This result is proved using the functional delta method and the results of Beare and Fang (2017), who demonstrated that the LCM operation is a directionally Hadamard differentiable mapping at any concave function.

We now turn to large-sample results for $\theta_{n}^{*}$ at points $z$ where both $F_{0}$ and $G_{0}$ possess Lebesgue density functions $f_{0}$ and $g_{0}$ , respectively. First, consistency of $\mu_{n}^{*}$ implies consistency of $\theta_{n}^{*}$ .

Theorem 5 (Consistency).

If $f_{0}$ is continuous at $x$ , $g_{0}$ is continuous at $x$ , and $g_{0}(x)>0$ , then $\theta_{n}^{*}(x)\operatorname*{\stackrel{{\scriptstyle P}}{{\longrightarrow}}}\theta_{0}(x)$ . If $f_{0}$ and $g_{0}$ are uniformly continuous on $\mathscr{G}_{0}$ , then $\sup_{x\in I}|\theta_{n}^{*}(x)-\theta_{0}(x)|\operatorname*{\stackrel{{\scriptstyle P}}{{\longrightarrow}}}0$ for any strict sub-interval $I\subsetneq\mathscr{G}_{0}$ .

We recall that, at any $z$ such that $h_{0}=\pi_{0}f_{0}+(1-\pi_{0})g_{0}$ is positive and continuous in a neighborhood of $z$ , $\mu_{0}(z)\in(0,1)$ , and $\mu_{0}$ is continuously differentiable in a neighborhood of $z$ , it holds that

[TABLE]

where $W$ follows Chernoff’s distribution, defined as the point of maximum of $Z(u)-u^{2}$ for $Z$ a two-sided standard Brownian motion originating from zero (Brunk, 1970; Groeneboom and Jongbloed, 2014). We can then use the delta-method to see that

[TABLE]

The scale parameter in the above limit distribution is equal to $\left[4\kappa_{0}(z)\theta_{0}^{\prime}(z)\right]^{1/3}$ for

[TABLE]

This yields the following result.

Theorem 6 (Pointwise convergence in distribution).

Suppose that, in a neighborhood of $z$ , $\theta_{0}$ is continuously differentiable with $\theta_{0}^{\prime}(z)>0$ , and $f_{0}$ and $g_{0}$ are positive and continuous. Then $n^{1/3}[\theta_{n}^{*}(z)-\theta_{0}(z)]\operatorname*{\stackrel{{\scriptstyle d}}{{\longrightarrow}}}\left[4\kappa_{0}(z)\theta_{0}^{\prime}(z)\right]^{1/3}W.$

Theorem 6 reflects certain common tradeoffs in the monotonicity-constrained literature. Theorem 6 indicates that the non-smoothed estimator converges pointwise at the $n^{-1/3}$ rate. In contrast, the smoothed estimator proposed by Yu et al. (2017) converges at the faster $n^{-2/5}$ rate, albeit under stronger smoothness assumptions. While Yu et al. (2017) did not propose a method for conducting inference, smoothed estimators typically possess an asymptotic bias that complicates the task of performing valid inference. In contrast, the limit distribution in Theorem 6 has mean zero, which we can use to construct asymptotically valid confidence intervals. Defining $\tau_{n}(z)$ as an estimator of $\tau_{0}(z):=\kappa_{0}(z)\theta_{0}^{\prime}(z)$ and $q_{\alpha}$ the $1-\alpha/2$ quantile of $W$ , a $100(1-\alpha)\%$ Wald-type confidence interval for $\theta_{0}(z)$ is given by $\theta_{n}^{*}(z)\pm[4\tau_{n}(z)/n]^{1/3}q_{1-\alpha/2}$ . If $\tau_{n}(z)\operatorname*{\stackrel{{\scriptstyle P}}{{\longrightarrow}}}\tau_{0}(z)$ , then this interval has asymptotic coverage of $100(1-\alpha)\%$ . The quantiles of $W$ were computed by Groeneboom and Wellner (2001), and in particular $q_{0.975}\approx 0.9982$ .

In practice, we recommend an alternative method to constructing confidence intervals for $\theta_{0}(z)$ . We recommend first constructing confidence intervals for $\mu_{0}(z)$ using either of two existing methods, then transforming these intervals into intervals for $\theta_{0}(z)$ . Specifically, if $[\ell_{n}(z),u_{n}(z)]$ represents a $100(1-\alpha)\%$ confidence interval for $\mu_{0}(z)$ , then we take $[T(\ell_{n}(z))/T(\pi_{n}),$ $T(u_{n}(z))/T(\pi_{n})]$ as a $100(1-\alpha)\%$ confidence interval for $\theta_{0}(z)$ . Two existing ways to construct $[\ell_{n}(z),u_{n}(z)]$ are Wald-type intervals with plug-in estimation of nuisance parameters and intervals based on likelihood ratio tests. The former intervals are analogous to the Wald-type interval, but based on the limit distribution for $n^{1/3}[\mu_{n}^{*}(z)-\mu_{0}(z)]$ given in (2). Alternatively, confidence intervals obtained by inverting likelihood ratio tests, proposed first by Banerjee and Wellner (2001) and studied further by, e.g. Banerjee (2007) and Groeneboom and Jongbloed (2015), can be formed based on the limiting distribution of twice the log of the ratio of the likelihoods of the maximum likelihood estimator and a suitably constrained maximum likelihood estimator. Since this limiting distribution is pivotal, meaning it does not depend on any unknown features of the true distribution, this approach does not require estimating any unknown nuisance parameters. We therefore expect this method to have better finite-sample properties than intervals based on plug-in estimation of nuisance parameters.

5 Numerical studies

In Supplementary Material, we present results of two simulation studies in the cases where $F_{0}$ and $G_{0}$ are fully discrete and fully continuous. In short, these studies confirm the validity of our large-sample theory and demonstrate that the maximum likelihood estimator and various proposed methods of conducting inference perform well in both cases. Here, we present the results of a numerical study illustrating the behavior of $\theta_{n}^{*}$ when $F_{0}$ and $G_{0}$ are mixed discrete-continuous distributions. We note that our asymptotic results did not address the behavior of $\theta_{n}^{*}$ at mass points in mixed discrete-continuous distributions; to the best of our knowledge, no such results yet exist for monotone estimators. We use this numerical study to explore this important case.

We simulated $Y$ as a mixed discrete-continuous random variable with probability 1/9 each of being 0, 0.5 and 1, and probability 2/3 of being from the uniform distribution on $[0,1]$ , and simulated $X$ as a mixed discrete-continuous random variable with probabilities 1/18, 1/9, and 3/18 of being 0, 0.5, and 1, respectively and probability 2/3 of being from the density function $x\mapsto I_{[0,1]}(x)(0.5+x)$ . We then have that $\theta_{0}(x)=0.5+x$ for $x\in[0,1]$ . We set $\pi_{0}:=0.4$ . For each combined sample size $n\in\{500,1K,5K,10K\}$ , we simulated $1000$ datasets, and in each dataset we computed the maximum likelihood estimator, the maximum smoothed likelihood estimator of Yu et al. (2017), and the non-monotone estimator based on kernel density estimates for each $z\in\{0,0.05,\dotsc,0.95,1\}$ . We constructed confidence intervals at each $z$ using the transformed plug-in and likelihood ratio-based methods described in Section 4.2. To estimate the scale parameter in the limit distribution of $\mu_{n}^{*}(z)$ as defined in equation 2, we used the plug-in estimator $\mu_{n}^{*}(z)$ for $\mu_{0}(z)$ and estimated $\mu_{0}^{\prime}(z)/h_{0}(z)=(\mu_{0}\circ H_{0}^{-1})^{\prime}\circ H_{0}(z)$ using the derivative of a local linear smoother of $\mu_{n}^{*}\circ H_{n}^{-}$ evaluated at $H_{n}(z)$ .

In addition to the properties of the estimators listed above, we also investigated the properties of the general sample-splitting procedure proposed by Banerjee et al. (2019). Given a generic monotone estimator $\gamma_{n}$ of a monotone function $\gamma_{0}$ such that $n^{1/3}[\gamma_{n}(z)-\gamma_{0}(z)]\operatorname*{\stackrel{{\scriptstyle d}}{{\longrightarrow}}}G$ for $G$ a mean-zero distribution with finite variance, Banerjee et al. (2019) proposed randomly splitting the sample into $m$ subsets of roughly equal size, computing monotone estimates $\gamma_{n,1},\dotsc,\gamma_{n,m}$ in each subset, then defining $\bar{\gamma}_{n,m}(z):=\frac{1}{m}\sum_{j=1}^{m}\gamma_{n,j}(z)$ . They demonstrated that if $m>1$ is fixed, then under mild conditions $\bar{\gamma}_{n,m}(z)$ has strictly better asymptotic mean squared error than $\gamma_{n}(z)$ , and that for moderate $m$ , $\bar{\gamma}_{n,m}(z)\pm\sigma_{n,m}(z)t_{1-\alpha/2,m-1}/\sqrt{m}$ forms an asymptotic $100(1-\alpha)\%$ confidence interval for $\gamma_{0}(z)$ , where $\sigma_{n,m}^{2}(z):=\frac{1}{m-1}\sum_{j=1}^{m}[\gamma_{n,j}(z)-\bar{\gamma}_{n,m}(z)]^{2}$ and $t_{1-\alpha/2,m-1}$ is the $100(1-\alpha/2)$ quantile of the $t$ -distribution with $m-1$ degrees of freedom. Therefore, $\bar{\gamma}_{n,m}(z)$ is preferable to $\gamma_{n}(z)$ for two reasons: it has better asymptotic mean squared error, and asymptotically valid pointwise confidence intervals for $\gamma_{0}$ based on $\bar{\gamma}_{n,m}$ can be formed without estimating any nuisance parameters. They also studied the asymptotic properties of $\bar{\gamma}_{n,m_{n}}(z)$ when $m_{n}$ grows with $n$ . In our simulation study, we considered the estimator $\bar{\theta}_{n,m}$ defined as $\bar{\theta}_{n,m}(z):=\frac{1}{m}\sum_{j=1}^{m}\theta_{n,j}^{*}(z)$ , where $\theta_{n,j}^{*}$ is the maximum likelihood estimator in the $j$ th subset, and the corresponding confidence intervals defined above. We only considered the situation where $m\in\{5,10\}$ is fixed with the sample size.

We now turn to the results of the simulation study. The left panel of Figure 1 displays the distribution of $\theta_{n}^{*}(z)-\theta_{0}(z)$ for $z\in[0,1]$ and $n=10K$ . These distributions are approximately centered around 0 for $z\in(0,1)$ , but not for $z\in\{0,1\}$ . Hence, despite the positive mass at the boundaries, the maximum likelihood estimator does not appear to be consistent at the boundaries. This is a common problem among monotonicity-constrained estimators, and various correction procedures have been proposed and could be considered in this context (see, e.g. Woodroofe and Sun, 1993; Kulikov and Lopuhaä, 2006).

The right panel of Figure 1 displays the ratio of the standard deviation of $r_{n}[\theta_{n}^{*}(z)-\theta_{0}(z)]$ to the standard deviation of the asymptotic distributions derived in Section 4 for $z\neq 0,1$ . For $z=0.5$ , $r_{n}=n^{1/2}$ and the asymptotic distribution is that of the fully discrete case presented in Section 4.1, though we note that the results presented in that section do not apply here due to the mixed discrete-continuous nature of $F_{0}$ and $G_{0}$ here. Otherwise, $r_{n}=n^{1/3}$ and the asymptotic distribution is that of the continuous case presented in Section 4.2. We see that, for $z\neq 0.5$ , the empirical standard error approaches the asymptotic standard deviation as $n$ grows. However, for $z=0.5$ , the empirical standard error is converging to a limit that is strictly smaller than the asymptotic standard deviation. This suggests that, at points that have both positive mass and positive density in a neighborhood of the point, the maximum likelihood estimator gains efficiency from the positive density. In addition, points of continuity near the mass point also experience finite-sample efficiency gains.

Figure 2 shows the ratio of the mean squared errors of the maximum smoothed likelihood estimator, the kernel density-based estimator, and the sample splitting estimators to that of the maximum likelihood estimator. The maximum smoothed likelihood estimator is slightly more efficient than the maximum likelihood estimator at continuity points, but is less efficient around mass points. Furthermore, the relative performance of the maximum likelihood estimator at positive mass points increases as the sample size grows. The kernel density estimator is generally less efficient than the maximum likelihood estimator, especially near mass points, and the discrepancy also grows with the sample size.

For large enough $n$ , the sample splitting estimator is more efficient than the maximum likelihood estimator at all points at which the latter is consistent. The relative improvement of $\bar{\theta}_{n,m}$ grows with the number of splits $m$ , as does the sample size $n$ required for $\bar{\theta}_{n,m}$ to outperform $\theta_{n}^{*}$ .

Figure 3 shows the empirical coverage of 95% confidence intervals for $\theta_{0}(z)$ constructed using the plug-in method described in Section 4.2, the inverted likelihood ratio test approach of Banerjee and Wellner (2001), and the sample splitting approach of Banerjee et al. (2019) described above. We note that the likelihood ratio approach does not provide intervals at the end points $z=0$ or $z=1$ . The plug-in method is conservative in large samples near mass points, but anti-conservative at some points of positive density. This is because the plug-in method is designed to work when the distributions are fully continuous, and estimation of the required nuisance parameters in the limit distribution fails in the presence of mass points. The likelihood ratio method is conservative in smaller samples, but approaches nominal coverage in large samples for points $z$ of absolute continuity. The sample splitting method with $m=5$ has adequate coverage for all sample sizes except for $z$ close to the boundaries. The sample splitting method with $m=10$ (and similarly for $m=20$ , which is not shown) appears to require very large sample sizes to attain adequate coverage over a large range of $z$ . We note that the sample splitting methods was able to achieve good coverage in large samples at both interior absolutely continuous points and interior mass points, without the user specifying which points are which.

6 Analysis of C-reactive protein for predicting bacterial infection

In this section, we use the methods presented herein to assess the use of the biomarker C-reactive protein (CRP) for determining the presence or absence of bacterial infection in children with systemic inflammatory response syndrome (SIRS). The Optimizing Antibiotic Strategies in Sepsis (OASIS) II study enrolled a prospective observational cohort of children under the age of nineteen at the pediatric intensive care unit at The Children’s Hospital of Philadelphia from August 2012 to June 2016 (Downes et al., 2018). Patients were enrolled in the study if they presented signs of SIRS, were started on a new broad-spectrum antibiotic for suspected bacterial infection, and had blood cultures taken within six hours of SIRS onset. A primary goal of the study was to assess whether CRP, which had previously been found to be predictive of bacterial infection (Downes et al., 2017), could be used to determine when antibiotic therapy could be safely ended. Additional details of the study design and results of the primary analysis may be found in Downes et al. (2018).

We analyzed all patients in the OASIS II cohort with measured biomarkers and bacterial infection status to assess the odds of bacterial infection as a function of CRP value. Some patients had measurements at multiple episodes; since all such episodes were at least 30 days apart, we treated these episodes as independent of one another. We analyzed a total of $n=504$ CRP measurements among 443 unique patients, with $n_{1}=202$ bacterial infections among 191 unique patients and $n_{2}=302$ non-infections among 266 unique patients.

Since CRP has previously been found to be predictive of bacterial infection in this patient population, there is scientific reason to believe that the density ratio order holds. We therefore computed the MLE of the density ratio function and corresponding 95% likelihood ratio-based pointwise confidence intervals and the sample splitting estimator of Banerjee et al. (2019) with $m=5$ splits and corresponding 95% pointwise confidence intervals.

Figure 4 displays the estimated odds of bacterial infection given CRP value relative to the population odds of bacterial infection and 95% pointwise confidence intervals. We find that values of CRP under 1 are indicative of roughly quartered odds of infection relative to the population odds of infection, and values of CRP greater than 20 are indicative of roughly doubled odds of infection relative to the population odds. Values of CRP between 1 and 20 do not clearly indicate that a patient’s odds of infection are larger or smaller than the population odds.

7 Discussion

In this article, we have considered nonparametric maximum likelihood inference for the density ratio function and the individual distribution functions under the assumption that the density ratio is nondecreasing. We applied these methods to the analysis of the biomarker C-reactive protein for predicting bacterial infection in children with systemic inflammatory response syndrome. The methods apply broadly to biomarker analysis, as well as other areas of biomedical research.

One of our important contributions is the ability to deal with discrete, continuous, and mixed discrete-continuous distributions. Such distributions arise frequently in applied settings, and in particular in the context of biomarker analysis. Furthermore, we have demonstrated via numerical studies that sample splitting provides good pointwise confidence interval coverage without knowing which values correspond to discrete mass points and which correspond to points of Lebesgue continuity of the underlying densities, which is important because in practice analysts may not have such knowledge a priori. However, a theoretical treatment of the precise asymptotic behavior of the estimator at mass points remains unknown, and would be an interesting topic of future research.

Acknowledgments

The authors thank Craig Boge for help compiling the OASIS II data. The authors also gratefully acknowledge the constructive comments of the editors and anonymous reviewers and the support of the CDC Epicenters program (KJD), NICHD grant K23HD091365 (KJD), the Center for Pediatric Clinical Effectiveness and the Pediatric IDEAS Research Group of the Children’s Hospital of Philadelphia (KJD, TW), and the Department of Pediatrics of the University of Pennsylvania Perelman School of Medicine.

Example of the use of Theorem 2

We first illustrate the use of Theorem 2 and Corollary 1 using hypothetical data. Suppose that $(Y_{1},\dots,Y_{6})=(0,0,1,3,3,6)$ and $(X_{1},\dotsc,X_{4})=(-1,2,3,3)$ . We first derive $F_{n}^{*}$ . The points $\{(H_{n}(y_{k}),F_{n}(y_{k})):k=0,\dotsc,m_{2}\}$ are given by $\{(0,0),$ $(0.3,0.25),$ $(0.4,0.25),$ $(0.9,1),$ $(1,1)\}$ , and its GCM is given by $\{(0,0),$ $(0.3,3/16),$ $(0.4,1/4),$ $(0.9,7/8),$ $(1,1)\}$ . This is displayed in the upper left panel of Figure 5. The values of the GCM imply that $F_{n}^{*}(0)=3/16$ , $F_{n}^{*}(1)=1/4$ , $F_{n}^{*}(3)=7/8$ , and $F_{n}^{*}(6)=1$ . We then have that $F_{n}^{*}(-1)=F_{n}^{*}(-\infty)+[F_{n}^{*}(0)-F_{n}^{*}(-\infty)]\frac{F_{n}(-1)-F_{n}(-1-)}{F_{n}(0)-F_{n}(-\infty)}=[3/16]\frac{1/4}{1/4}=3/16$ and $F_{n}^{*}(2)=F_{n}^{*}(1)+[F_{n}^{*}(3)-F_{n}^{*}(1)]\frac{F_{n}(2)-F_{n}(2-)}{F_{n}(3)-F_{n}(1)}=1/4+[5/8]\frac{1/4}{3/4}=11/24$ . The estimators $F_{n}$ and $F_{n}^{*}$ are compared in the bottom left panel of Figure 5.

We next derive $G_{n}^{*}$ . The points $\{(H_{n}(y_{k}),G_{n}(y_{k})):k=0,\dotsc,m_{2}\}$ are given by $\{(0,0),$ $(0.3,1/3),$ $(0.4,1/2),$ $(0.9,5/6),$ $(1,1)\}$ , and its LCM is given by $\{(0,0),$ $(0.3,3/8),$ $(0.4,1/2),$ $(0.9,11/12),$ $(1,1)\}$ . This is displayed in the center left panel of Figure 5. The values of the LCM imply that $G_{n}^{*}(0)=3/8$ , $G_{n}^{*}(1)=1/2$ , $G_{n}^{*}(3)=11/12$ , and $G_{n}^{*}(6)=1$ . The estimators $G_{n}$ and $G_{n}^{*}$ are compared in the bottom left panel of Figure 5.

Finally, we derive $\theta_{n}^{*}$ . The empirical ordinal dominance curve is given by the points $\{(0,0)$ , $(1/3,1/4)$ , $(1/2,1/4)$ , $(5/6,1)$ , $(1,1)\}$ , and the vertices of its GCM are given by $\{(0,0,)$ , $(1/2,1/4)$ , $(1,1)\}$ . This is displayed in the bottom left panel of Figure 5. The left-hand slopes of the GCM are $1/2$ on the interval $(0,1/2]$ and $3/2$ on the interval $(1/2,1]$ , which implies that $\theta_{n}^{*}(z)=1/2$ for $z\in(-\infty,1]$ and $\theta_{n}^{*}(z)=3/2$ for $z\in(1,\infty)$ . This is displayed in the bottom right panel of Figure 5.

We note that the maximum likelihood estimators $\hat{F}_{n}$ of $F_{0}$ and $\hat{G}_{n}$ of $G_{0}$ derived in Dykstra et al. (1995) for the fully discrete case are different than $F_{n}^{*}$ and $G_{n}^{*}$ . In particular, both $\hat{F}_{n}$ and $\hat{G}_{n}$ have jumps at all the unique values of the data $\{-1,0,1,2,3,6\}$ with values $\hat{F}_{n}(-1)=1/16$ , $\hat{F}_{n}(0)=3/16$ , $\hat{F}_{n}(1)=1/4$ , $\hat{F}_{n}(2)=3/8$ , and $\hat{F}_{n}(3)=7/8$ , and $\hat{F}_{n}(6)=1$ ; and $\hat{G}_{n}(-1)=1/8$ , $\hat{G}_{n}(0)=3/8$ , $\hat{G}_{n}(1)=1/2$ , $\hat{G}_{n}(2)=7/12$ , $\hat{G}_{n}(3)=11/12$ , and $\hat{G}_{n}(6)=1$ . However, the maximum likelihood estimator $\hat{\theta}_{n}(z)=\Delta\hat{F}_{n}(z)/\Delta\hat{G}_{n}(z)$ is equal to $\theta_{n}^{*}(z)$ for each $z\in\{-1,0,1,2,3,6\}$ .

Proof of Theorems

Proof of Theorem 1.

We first suppose that $F\ll G$ and $\nu$ is non-decreasing on $\mathscr{G}$ , and we show that $F(A)G(B)\leq F(B)G(A)$ for all measurable $A\leq B$ . Recall that $F(A)=\int_{A}\,dF$ when $A$ is a set, and $A\leq B$ means that $a\leq b$ for all $a\in A$ and $b\in B$ . Since $F\ll G$ , we have that $F(x)=\int_{-\infty}^{x}\nu(u)\,dG(u)$ for all $x$ . We then have by Fubini’s Theorem that

[TABLE]

Now since $\nu$ is non-decreasing and $x\leq y$ for all $x\in A$ and $y\in B$ , we have

[TABLE]

Finally, applying Fubini’s Theorem again yields

[TABLE]

Next, we suppose that $F(A)G(B)\leq F(B)G(A)$ for all measurable $A\leq B$ , and we show that $R_{F,G}$ is convex on $\mathrm{Im}(G)$ . Let $t,u,v\in\mathrm{Im}(G)$ , where $t<v$ and $u=\lambda t+(1-\lambda)v$ for $\lambda\in(0,1)$ . We then let $A=(G^{-}(t),G^{-}(u)]$ and $B=(G^{-}(u),G^{-}(v)]$ , which are both Borel sets satisfying $A\leq B$ since $G^{-}$ is necessarily non-decreasing. We then have $F(A)=F(G^{-}(u))-F(G^{-}(t))=R_{F,G}(u)-R_{F,G}(t)$ and similarly $F(B)=R_{F,G}(v)-R_{F,G}(t)$ . In addition, since $G(G^{-}(z))=z$ for any $z\in\mathrm{Im}(G)$ , we also have $G(A)=G(G^{-}(u))-G(G^{-}(t))=u-t=(1-\lambda)(v-t)$ and similarly $G(B)=v-u=\lambda(v-t)$ . We then have by assumption that

[TABLE]

Therefore, $\lambda\left[R_{F,G}(u)-R_{F,G}(t)\right]\leq(1-\lambda)\left[R_{F,G}(v)-R_{F,G}(u)\right]$ , which implies that $R_{F,G}(u)\leq\lambda R_{F,G}(t)+(1-\lambda)R_{F,G}(v)$ , which shows that $R_{F,G}$ is convex on $\mathrm{Im}(G)$ .

Finally, we suppose that $F\ll G$ , $R:=R_{F,G}$ is convex on $\mathrm{Im}(G)$ , and $\nu$ is continuous on $\mathscr{G}$ , and we show that $\nu$ is nondecreasing on $\mathscr{G}$ . This is the most difficult of the three implications. The basic argument amounts to using convexity of $R$ to compare the slopes of chords or sequences of chords, and to relate these slopes to values of $\nu$ . Let $x,y\in\mathscr{G}$ with $x<y$ . Suppose that we can find sequences $\{z_{j}\}_{j\geq 1}$ and $\{w_{j}\}_{j\geq 1}$ such that $s_{j}:=[R(G(x))-R(G(z_{j}))]/[G(x)-G(z_{j})]$ converges to $\nu(x)$ , $t_{j}:=[R(G(y))-R(G(w_{j}))]/[G(y)-G(w_{j})]$ converges to $\nu(y)$ , and $z_{j}\leq w_{j}$ for all $j$ large enough. Then, by convexity of $R$ , $s_{j}\leq t_{j}$ for all $j$ large enough, which implies that $\nu(x)\leq\nu(y)$ . The exact form of $\{z_{j}\}_{j\geq 1}$ and $\{w_{j}\}_{j\geq 1}$ depends on how $G$ looks near $x$ and $y$ . In particular, there are three cases for $y$ : (1) $G(y)>G(y-)$ and there exists $p\in[x,y)$ such that $G(y-)=G(p)$ ; (2) $G(y)>G(y-)$ but there is no $p\in[x,y)$ such that $G(y-)=G(p)$ ; and (3) $G(y)=G(y-)$ . We begin by specifying $\{w_{j}\}_{j\geq 1}$ in each case.

In case (1), we take $w_{j}=p$ for all $j$ . Since $F\ll G$ , we must have $F(G^{-}(G(p)))=F(y-)$ , so that $t_{j}=\nu(y)$ for all $j$ . In case (2), it must be that $G^{-}(G(y-))=y$ . In this case, there exists $\{w_{j}\}_{j\geq 1}$ increasing to $y$ such that $w_{j}\in(x,y)\cap\mathscr{G}$ for each $j$ , $G(w_{j})$ increases to $G(y-)$ and $F(w_{j})$ increases to $F(y-)$ . We then have that $R(G(w_{j}))$ increases to $F(G^{-}(G(y-))-)=F(y-)$ , so that $t_{j}$ increases to $[F(y)-F(y-)]/[G(y)-G(y-)]=\nu(y)$ . In case (3), we first note that $F(G^{-}(G(y)))=F(y)$ since $F\ll G$ . Additionally, since $y\in\mathscr{G}$ , there exist $\{w_{j}\}_{j\geq 1}$ in $\mathscr{G}$ with $G^{-}(G(w_{j}))=w_{j}$ for each $j$ that either (a) increases to $y$ and $G(w_{j})<G(y)$ for each $j$ , or (b) decreases to $y$ and $G(w_{j})>G(y)$ for each $j$ . In either case, we have

[TABLE]

For any $\varepsilon>0$ , by continuity of $\nu$ over $\mathscr{G}$ , we can find $m$ such that $j\geq m$ implies $|\nu(u)-\nu(y)|<\varepsilon$ for all $u\in[w_{j},y]\cap\mathscr{G}$ . If (a) holds and $t_{j}$ is bounded above, we then have $\int_{w_{j}}^{y}|\nu(u)-\nu(y)|\,dG(u)\leq\varepsilon[G(y)-G(w_{j})]$ for all $j\geq m$ , so that then $\lim_{j\to\infty}t_{j}=\nu(y)$ . If $t_{j}$ is not bounded above then $\nu(y)=+\infty$ , so that $\nu(x)\leq\nu(y)$ trivially. If (b) holds then $t_{j}$ is bounded below by zero, so by a similar calculation $\lim_{j\to\infty}t_{j}=\nu(y)$ .

The three cases for $x$ are similar: (1) $G(x)>G(x-)$ and there exists $q\in[-\infty,x)$ such that $G(x-)=G(q)$ ; (2) $G(x)>G(x-)$ but there is no such $q$ ; and (3) $G(x)=G(x-)$ . In case (1), we take $z_{j}=q$ for all $j$ . Since $F\ll G$ , we must have $F(G^{-}(G(q)))=F(x-)$ , so that $s_{j}=\nu(y)$ for all $j$ . In case (2), it must be that $G^{-}(G(x-))=x$ , and again there exists an increasing sequence $\{z_{j}\}_{j\geq 1}$ increasing to $x$ such that $z_{j}\in(-\infty,x)\cap\mathscr{G}$ for each $j$ , $G(z_{j})$ increases to $G(x-)$ and $F(z_{j})$ increases to $F(x-)$ . We then have that $R(G(z_{j}))$ increases to $F(x-)$ , so that $s_{j}$ increases to $\nu(x)$ . In case (3), $F(G^{-}(G(x)))=F(x)$ , and since $x\in\mathscr{G}$ , there exists $\{z_{j}\}_{j\geq 1}$ in $\mathscr{G}$ with $G^{-}(G(z_{j}))=z_{j}$ for each $j$ that either (a) increases to $x$ and $G(z_{j})<G(x)$ for each $j$ , or (b) decreases to $x$ and $G(z_{j})>G(x)$ for each $j$ . If (a) holds and $s_{j}$ is bounded above, then $s_{j}$ converges to $\nu(x)$ by continuity of $\nu$ as before. If $s_{j}$ is not bounded above then $s_{j}$ converges to $\nu(x)=+\infty$ . If (b) holds then $s_{j}$ is bounded below by zero, so again $\lim_{j\to\infty}s_{j}=\nu(x)$ .

Of the nine pairings of cases for $y$ and cases for $x$ , the only situation in which it is not immediately clear that $z_{j}\leq w_{j}$ for all $j$ large enough is that $z_{j}$ decreases to $x$ (case 3b) and $w_{j}=p$ for all $j$ (case 1). However, we note that $x=p$ if and only if $G(x)=G(y-)$ , which would imply that case (3b) cannot hold for $x$ . Therefore, if $z_{j}$ decreases to $x$ and $w_{j}=p$ , then $p>x$ , so that $z_{j}<w_{j}$ for all $j$ large enough. This completes the argument.

Finally, we address statement (2) of the result: we suppose that $F\ll G$ and $\nu$ is continuous and non-decreasing on $\mathscr{G}$ , and we show that $\theta_{F,G}=\nu$ on $\mathscr{G}$ . By (1), $R$ is convex on $\mathrm{Im}(G)$ . First, we claim that $\mathrm{GCM}_{[0,1]}(R)=H$ , where $H:[0,1]\to[0,1]$ takes the following form. For any $u\in\mathrm{Im}(G)$ , $H(u):=R(u)$ . If $u\notin\mathrm{Im}(G)$ , then there exists $x\in\mathbb{R}$ and $\lambda\in[0,1)$ such that $u=\lambda G(x-)+(1-\lambda)G(x)$ . We then define $H(u):=\lambda R(G(x-)-)+(1-\lambda)R(G(x))$ . Thus, $H$ is the linear interpolation of $R|_{\mathrm{Im}(G)}$ to $[0,1]$ . In order to show that $H$ indeed equals $\mathrm{GCM}_{[0,1]}(R)$ , we need to show that (a) $H$ is convex, (b) $H\leq R$ , and (c) $H\geq\bar{H}$ for any other convex minorant of $R$ .

For (a), we let $u,v\in[0,1]$ and $p=\lambda u+(1-\lambda)v$ for $\lambda\in(0,1)$ . There then exist $u_{1}\leq u_{2}\leq p_{1}\leq p_{2}\leq v_{1}\leq v_{2}$ which are all elements of $\mathrm{Im}(G)$ and $\lambda_{1},\lambda_{2},\lambda_{3}\in[0,1]$ such that $u=\lambda_{1}u_{1}+(1-\lambda_{1})u_{2}$ , $v=\lambda_{2}v_{1}+(1-\lambda_{2})v_{2}$ , and $p=\lambda_{3}p_{1}+(1-\lambda_{3})p_{2}$ , and furthermore $H(u)=\lambda_{1}R(u_{1}-)+(1-\lambda_{1})R(u_{2})$ , $H(v)=\lambda_{2}R(v_{1}-)+(1-\lambda_{2})R(v_{2})$ , and $H(p)=\lambda_{3}R(p_{1}-)+(1-\lambda_{3})R(p_{2})$ . The remainder of the argument is best seen with a diagram. Let $U$ be the point $(u,H(u))$ , $U_{1}$ be the point $(u_{1},H(u_{1}))$ , and so on. By convexity of $R$ , the line segment $\overline{P_{1}P_{2}}$ lies below or on the line segment $\overline{U_{2}V_{1}}$ , which lies below or on $\overline{UV_{1}}$ , which lies below or on $\overline{UV}$ . Therefore, $(p,H(p))$ , which falls on $\overline{P_{1}P_{2}}$ , is no greater than $(p,\lambda H(u)+(1-\lambda)H(p))$ , which falls on $\overline{UV}$ .

For (b), by definition, $H(u)=R(u)$ for any $u\in\mathrm{Im}(G)$ . If $u\notin\mathrm{Im}(G)$ , then $u=\lambda G(x-)+(1-\lambda)G(x)$ , and hence $G^{-}(u)=G^{-}(G(x))=x$ . As a result, $R(u)=R(G(x))>H(u)=\lambda R(G(x-)-)+(1-\lambda)R(G(x))$ .

We have now shown that $H$ is a convex minorant of $R$ . For (c), if $\bar{H}$ is another convex minorant of $R$ , then clearly $H(u)\geq\bar{H}(u)$ for all $u\in\mathrm{Im}(G)$ . If $u\notin\mathrm{Im}(G)$ , then $u=\lambda G(x-)+(1-\lambda)G(x)$ . If $G(x-)\in\mathrm{Im}(G)$ , then $\bar{H}(u)\leq\lambda H(G(x-))+(1-\lambda)H(G(x))\leq\lambda R(G(x-))+(1-\lambda)R(G(x))=H(u)$ . If $G(x-)\notin\mathrm{Im}(G)$ , then there must be an $\varepsilon>0$ such that $z\in\mathrm{Im}(G)$ for all $z\in(G(x-)-\varepsilon,G(x-))$ , so that $\bar{H}(u)\leq\lambda(z)R(z-)+(1-\lambda(z))R(G(x))$ for each $z\in(G(x-)-\varepsilon,G(x-))$ , where $\lambda(z)\in(0,1)$ and $\lambda(z)\to\lambda$ as $z\to G(x-)$ . Taking the limit as $z\to G(x-)$ , we have that $\bar{H}(u)\leq\lambda R(G(x-)-)+(1-\lambda)R(G(x))=H(u)$ .

We now have that $\theta_{F,G}(x)=(\partial_{-}H)(G(x))$ , so it remains to show that $(\partial_{-}H)(G(x))=\nu(x)$ for all $x\in\mathscr{G}$ . First, if $G(x)>G(x-)$ , then $H(u)=\lambda R(G(x-)-)+(1-\lambda)R(G(x))=\lambda F(x-)+(1-\lambda)F(x)$ for all $u=\lambda G(x-)+(1-\lambda)G(x)$ for $\lambda\in(0,1)$ . Therefore, $(\partial_{-}H)(u)=[F(x)-F(x-)]/[G(x)-G(x-)]=\nu(x)$ for all such $u$ , so that $(\partial_{-}H)(G(x))=\nu(x)$ . If instead $x\in\mathscr{G}$ and $G(x)=G(x-)$ then $H(G(x))=R(G(x)),$ and it is straightforward to see from the definition of $R$ that $(\partial_{-}R)(G(x))=\nu(x)$ . ∎

Proof of Theorem 2.

We first note that $L_{n}(F,G)=0$ for any $G$ such that $G(Y_{j})=G(Y_{j}-)$ for any $j\in\{1,\dotsc,n_{2}\}$ . As a result, we may restrict our attention to $G$ such that $G(Y_{j})>G(Y_{j}-)$ for all $j$ , which implies that $G^{-}$ has support at each $G(Y_{j})$ . For any such $G$ , we define $\bar{G}:=G\circ L$ , where $L(y):=\max\{Y_{j}:Y_{j}\leq y\}$ . We then have $\bar{G}(Y_{j})-\bar{G}(Y_{j}-)\geq G(Y_{j})-G(Y_{j}-)$ for each $j$ . Furthermore, the support of $\bar{G}^{-}$ is $\{G(Y_{j}):j=1,\dotsc,n_{2}\}$ is contained in the support of $G$ , $\bar{G}(Y_{j})=G(Y_{j})$ for each $j$ , and $F\circ G^{-}$ is by assumption convex on the support of $G^{-}$ . Therefore, $F\circ\bar{G}^{-}$ is convex on the support of $\bar{G}^{-}$ , so that $(F,\bar{G})\in\mathscr{M}_{0}$ and $L_{n}(F,\bar{G})\geq L_{n}(F,G)$ . Hence, we may further restrict our attention to $G$ which are discrete with jumps at $Y_{1},\dotsc,Y_{n_{2}}$ . By a similar argument, we can restrict our attention to $F$ which are discrete with jumps at $X_{1},\dotsc,X_{n_{1}}$ or $Y_{1},\dotsc,Y_{n_{2}}$ .

We define $y_{0}:=-\infty$ , and $u_{j}:=G(y_{j})$ , so that the support of $G^{-}$ for any discrete $G$ with jumps at $Y_{1},\dotsc,Y_{n_{2}}$ is $\{u_{j}:j=0,\dotsc,m_{2}\}$ , and $G^{-}(u_{j})=y_{j}$ . Defining $g_{j}:=u_{j}-u_{j-1}$ and $s_{j}$ the number of $Y_{k}$ such that $Y_{k}=y_{j}$ , we have $\prod_{j=1}^{n_{2}}\left[G(Y_{j})-G(Y_{j}-)\right]=\prod_{j=1}^{m_{2}}g_{j}^{s_{j}}$ . We then define $f_{j}:=F(y_{j})-F(y_{j}-)$ for each $j$ , and we note that $(F,G)\in\mathscr{M}_{0}$ if and only if $f_{1}/g_{1}\leq f_{2}/g_{2}\leq\cdots\leq f_{m_{2}}/g_{m_{2}}$ . Suppose that the values $f_{1},\dotsc,f_{m_{2}}$ are fixed in such a way as to satisfy these constraints. We denote by $\mathscr{I}_{j}:=\{k:x_{k}\in(y_{j-1},y_{j}]\}$ for $j=1,\dotsc,m_{2}+1$ , where $y_{m_{2}+1}:=+\infty$ , and by $r_{i}$ the number of $X_{k}$ such that $X_{k}=x_{i}$ . Noting that $\mathscr{I}_{1},\dotsc,\mathscr{I}_{m_{2}+1}$ are disjoint with union $\{1,\dotsc,m_{1}\}$ , we then have

[TABLE]

Additionally, for each $j\in\{1,\dotsc,m_{2}+1\}$ , we must have that $\sum_{k\in\mathscr{I}_{j}}\left[F(x_{k})-F(x_{k}-)\right]=f_{j}$ . Therefore, maximizing $L_{n}(F,G)$ with respect to $F$ with $f_{1},\dotsc,f_{m_{2}+1}$ fixed amounts to maximizing $\prod_{k\in\mathscr{I}_{j}}\left[F(x_{k})-F(x_{k}-)\right]^{r_{k}}$ subject to $\sum_{k\in\mathscr{I}_{j}}\left[F(x_{k})-F(x_{k}-)\right]=f_{j}$ for each $j$ . This implies that a maximizer $F_{n}^{*}$ must satisfy

[TABLE]

for each $x_{k}\in\mathscr{I}_{j}$ . Therefore, $\prod_{k\in\mathscr{I}_{j}}\left[F_{n}^{*}(x_{k})-F_{n}^{*}(x_{k}-)\right]^{r_{k}}$ is proportional to $\prod_{k\in\mathscr{I}_{j}}f_{j}^{r_{k}}=f_{j}^{R_{j}}$ for $R_{j}:=\sum_{k\in\mathscr{I}_{j}}r_{k}$ , which is the number of $X_{i}$ in the interval $(y_{j-1},y_{j}]$ .

We note that if there are $j$ such that no $x_{k}\in(y_{j},y_{j+1}]$ but $f_{j}>0$ , then there are infinitely many maximizers because any $F_{n}^{*}$ that assigns mass $f_{j}$ to the interval $(y_{j-1},y_{j}]$ yields the same likelihood and satisfies the constraints. In these cases, for the sake of uniqueness we will put mass $f_{j}$ at the point $y_{j}$ .

We have at this point reduced the problem to maximizing

[TABLE]

subject to $f_{1}/g_{1}\leq f_{2}/g_{2}\leq\cdots\leq f_{m_{2}}/g_{m_{2}}$ and $\sum_{k=1}^{m_{2}}g_{k}=\sum_{k=1}^{m_{2}+1}f_{k}=1$ . Letting $\bar{f}_{k}:=f_{k}/(1-f_{m_{2}+1})$ for $k\leq m_{2}$ , this is equivalent to maximizing

[TABLE]

subject to $\bar{f}_{1}/g_{1}\leq\bar{f}_{2}/g_{2}\leq\cdots\leq\bar{f}_{m_{2}}/g_{m_{2}}$ and $\sum_{k=1}^{m_{2}}g_{k}=\sum_{k=1}^{m_{2}}\bar{f}_{k}=1$ . The term involving $f_{m_{2}+1}$ is maximized for $f_{m_{2}+1}^{*}=R_{m_{2}+1}/n_{1}=1-F_{n}(y_{m_{2}})$ .

From this point we take a similar approach to that in Dykstra et al. (1995). We define $\bar{n}_{1}:=\sum_{k=1}^{m_{2}}R_{k}=F_{n}(y_{m_{2}})n_{1}$ , $\sigma_{k}:=\bar{n}_{1}\bar{f}_{k}+n_{2}g_{k}$ and $\rho_{k}:=\bar{n}_{1}\bar{f}_{k}/\sigma_{k}$ , so that $\bar{f}_{k}=\rho_{k}\sigma_{k}/\bar{n}_{1}$ and $g_{k}=(1-\rho_{k})\sigma_{k}/n_{2}$ . Optimizing $\bar{L}_{n}$ with with respect to $\bar{f}_{1},\dotsc,\bar{f}_{m_{2}}$ and $g_{1},\dotsc,g_{m_{2}}$ such that $\sum_{k=1}^{m_{2}}\bar{f}_{k}=\sum_{k=1}^{m_{2}}g_{k}=1$ and $\bar{f}_{1}/g_{1}\leq\bar{f}_{2}/g_{2}\leq\cdots\leq\bar{f}_{m_{2}}/g_{m_{2}}$ is equivalent to optimizing

[TABLE]

such that $\sum_{k=1}^{m_{2}}\rho_{k}\sigma_{k}=\bar{n}_{1}$ , $\sum_{k=1}^{m_{2}}\sigma_{k}=\bar{n}_{1}+n_{2}$ , and $\rho_{1}\leq\cdots\leq\rho_{m_{2}}$ , where $\boldsymbol{\rho}:=(\rho_{1},\dotsc,\rho_{m_{2}})$ and $\boldsymbol{\sigma}:=(\sigma_{1},\dotsc,\sigma_{m_{2}})$ .

Now, $\prod_{k=1}^{m_{2}}\sigma_{k}^{R_{k}+s_{k}}$ such that $\sum_{k=1}^{m_{2}}\sigma_{k}=\bar{n}_{1}+n_{2}$ is maximized for $\sigma_{k}^{*}=R_{k}+s_{k}$ . Next, maximizing $\prod_{i=1}^{m_{2}}\rho_{k}^{R_{k}}(1-\rho_{k})^{s_{k}}$ with respect to $\rho_{1}\leq\cdots\leq\rho_{m_{2}}$ is equivalent to maximizing

[TABLE]

for $w_{k}:=R_{k}+s_{k}\geq 1$ and $t_{k}:=R_{k}/w_{k}$ . By Theorem 2.1 and Exercise 2.21 of Groeneboom and Jongbloed (2014), the maximizer $(\rho_{1}^{*},\dotsc,\rho_{m_{2}}^{*})$ of this expression over all $\rho_{1}\leq\cdots\leq\rho_{m_{2}}$ is given by the weighted isotonic regression of $t_{1},\dotsc,t_{m_{2}}$ with weights $w_{1},\dotsc,w_{m_{2}}$ . By Lemma 2.1 of Groeneboom and Jongbloed (2014), $\rho_{k}^{*}$ is equal to the left derivative of the GCM of the set of points

[TABLE]

evaluated at $n_{1}F_{n}(y_{k})+n_{2}G_{n}(y_{k})$ . We note that $\sum_{k=1}^{m_{2}}w_{k}\rho_{k}^{*}=\sum_{k=1}^{m_{2}}\sigma_{k}^{*}\rho_{k}^{*}=n_{1}F(y_{m_{2}})=\bar{n}_{1}$ . Therefore, we have that $L_{n}(\boldsymbol{\rho},\boldsymbol{\sigma})\leq L_{n}(\boldsymbol{\rho}^{*},\boldsymbol{\sigma}^{*})$ for all $\boldsymbol{\rho}$ such that $\rho_{1}\leq\cdots\leq\rho_{m_{2}}$ and $\boldsymbol{\sigma}$ such that $\sum_{k=1}^{m_{2}}\sigma_{k}=\bar{n}_{1}+n_{2}$ . Since $\boldsymbol{\rho}^{*}$ and $\boldsymbol{\sigma}^{*}$ also satisfy $\sum_{k=1}^{m_{2}}\sigma_{k}^{*}\rho_{k}^{*}=\bar{n}_{1}$ , this implies that $(\boldsymbol{\rho}^{*},\boldsymbol{\sigma}^{*})$ is an optimizer of $\bar{L}_{n}$ over the set of stated constraints.

We now have that $f_{k}^{*}=(R_{k}+s_{k})(\rho_{k}^{*}/n_{1})$ and $g_{k}^{*}=(R_{k}+s_{k})(1-\rho_{k}^{*})/n_{2}$ . Since $w_{k}=R_{k}+s_{k}$ , this implies that $F_{n}^{*}(y_{k})=\bar{A}_{k}/n_{1}$ and $G_{n}^{*}(y_{k})=[n_{2}G_{n}(y_{k})+n_{1}F_{n}(y_{k})-\bar{A}_{k}]/n_{2}$ , where $\bar{A}_{k}$ is the value of the GCM of the set of points defined above at $n_{1}F_{n}(y_{k})+n_{2}G_{n}(y_{k})$ . We note that $\bar{A}_{k}/n_{1}=A_{k}^{*}$ , for $A_{k}^{*}$ the value of the GCM of $\left\{\left(\pi_{n}F_{n}(y_{k})+(1-\pi_{n})G_{n}(y_{k}),F_{n}(y_{k})\right):k=0,\dotsc,m_{2}\right\}$ evaluated at $\pi_{n}F_{n}(y_{k})+(1-\pi_{n})G_{n}(y_{k})$ . Additionally, $[n_{2}G_{n}(y_{k})+n_{1}F_{n}(y_{k})-\bar{A}_{k}]/n_{2}=B_{k}^{*}$ for $B_{k}^{*}$ the value of the LCM of $\left\{\left(\pi_{n}F_{n}(y_{k})+(1-\pi_{n})G_{n}(y_{k}),G_{n}(y_{k})\right):k=0,\dotsc,m_{2}\right\}$ at $\pi_{n}F_{n}(y_{k})+(1-\pi_{n})G_{n}(y_{k})$ . ∎

Proof of Corollary 1.

From the proof of Theorem 2, we have that $F_{n}^{*}(y_{k})=A_{k}^{*}$ and $G_{n}^{*}(y_{k})=G_{n}(y_{k})+\frac{\pi_{n}}{1-\pi_{n}}[F_{n}(y_{k})-A_{k}^{*}]$ . Let $j_{0}^{\prime},\dotsc,j_{K}^{\prime}$ denote the indices of the vertices of the GCM of $\left\{\left(\pi_{n}F_{n}(y_{k})+(1-\pi_{n})G_{n}(y_{k}),F_{n}(y_{k})\right):k=0,\dotsc,m_{2}\right\}$ . Then $F_{n}^{*}(y_{j_{k}})=F_{n}(y_{j_{k}})$ for each $k=0,\dotsc,K$ and $G_{n}^{*}(y_{j_{k}})=G_{n}(y_{j_{k}})$ . It is also straightforward to see that $\{(h_{k},A_{k}):k=0,\dotsc,m_{2}\}$ is a convex minorant of $\{(h_{k},F_{n}(y_{k})):k=0,\dotsc,m_{2}\}$ if and only if $\{(G_{n}(y_{k}),A_{k}):k=0,\dotsc,m_{2}\}$ is a convex minorant of $\{(G_{n}(y_{k}),F_{n}(y_{k})):k=0,\dotsc,m_{2}\}$ . Therefore, $\{(F_{n}(y_{j_{k}}),G_{n}(y_{j_{k}})):k=0,\dotsc,K\}$ form the vertices of the GCM of $\left\{\left(G_{n}(y_{k}),F_{n}(y_{k})\right):k=0,\dotsc,m_{2}\right\}$ . ∎

Proof of Theorem 3.

We note that $G_{n}(y_{j})>0$ for each $j$ with probability tending to one. Then, since the support $\mathscr{G}$ of $G_{0}$ is finite, with probability tending to one the empirical ODC is a left-continuous step function with vertices at $(0,0),(G_{n}(y_{1}),F_{n}(y_{1})),$ $\dotsc,(G_{n}(y_{m_{2}}),F_{n}(y_{m_{2}})$ , where we note that $G_{n}(y_{m_{2}})=1$ almost surley. We define

[TABLE]

which is positive by assumption. We then have

[TABLE]

Now since $F_{n}$ is uniformly consistent for $F_{0}$ and $G_{n}$ is uniformly consistent for $G_{0}$ , and $\Delta G_{0}(y_{j})>0$ for each $j$ , the second through fifth lines above are $o_{P}(1)$ uniformly over $j$ . Therefore,

[TABLE]

which implies that

[TABLE]

for all $j=1,\dotsc,m_{2}-1$ with probability tending to one. Therefore, with probability tending to one, the diagram of points $(0,0),(G_{n}(y_{1}),F_{n}(y_{1}))$ , $\dotsc,(G_{n}(y_{m_{2}}),F_{n}(y_{m_{2}}))$ is convex. Now by Corollary 1, the points $(G_{n}^{*}(y_{k}),F_{n}^{*}(y_{k}))$ lie on the GCM of $(0,0),(G_{n}(y_{1}),F_{n}(y_{1}))$ , $\dotsc,(G_{n}(y_{m_{2}}),F_{n}(y_{m_{2}}))$ . But these points being convex means that they are equal to their GCM, so that with probability tending to one $G_{n}^{*}(y_{j})=G_{n}(y_{j})$ and $F_{n}^{*}(y_{j})=F_{n}(y_{j})$ for each $j$ . We can then see by Theorem 2 that $\Delta F_{n}^{*}(x_{i})=\Delta F_{n}(x_{i})$ with probability tending to one as well, so that $F_{n}^{*}=F_{n}$ with probability tending to one. We then have with probability tending to one that

[TABLE]

for each $j=1,\dotsc,m_{2}$ . since the GCM of $R_{F_{n},G_{n}}$ is with probability tending to one piecewise linear with knots at the $y_{j}$ and $\theta_{n}^{*}=\partial_{-}\mathrm{GCM}_{[0,1]}(R_{F_{n},G_{n}})\circ G_{n}$ , we then have that with probability tending to one that $\theta_{n}^{*}$ is a left-continuous step function with jumps at the $y_{j}$ . Also, since $G_{n}(z)=0$ for $z<y_{1}$ and $R_{F_{n},G_{n}}(u)=0$ for all $u\leq 0$ , $\theta_{n}^{*}(z)=0$ for $z<y_{1}$ .

We now have

[TABLE]

Using the notation introduced in Section 3.2, this can be written as

[TABLE]

where $W_{i}=(W_{i}1,\dotsc,W_{i4})^{T}$ for $W_{i1}=I(D_{i}=1,y_{j-1}<Z_{i}\leq y_{j})$ , $W_{i2}=I(D_{i}=1)$ , $W_{i3}=I(D_{i}=0,y_{j-1}<Z_{i}\leq y_{j})$ , and $W_{i4}=I(D_{i}=0)$ , and $g(w_{1},w_{2},w_{3},w_{4})=\frac{w_{1}/w_{2}}{w_{3}/w_{4}}$ . By the Central Limit Theorem,

[TABLE]

where the $(j,k)$ element of the covariance matrix $V_{0}$ equals $E_{0}(W_{j}W_{k})-E_{0}(W_{j})E_{0}(W_{k})$ . Applying the delta method to the function $g$ yields (after some algebra) $n^{1/2}[\theta_{n}^{*}(y_{j})-\theta_{0}(y_{j})]\operatorname*{\stackrel{{\scriptstyle d}}{{\longrightarrow}}}N\left(0,\sigma_{0}^{2}(y_{j})\right)$ , where $\sigma_{0}^{2}(y_{j})$ equals

[TABLE]

∎

Proof of Theorem 4.

We note that $f_{0}(z)/g_{0}(z)\leq f_{0}(z^{\prime})/g_{0}(z^{\prime})$ for all $z<z^{\prime}$ in $[a,b]$ implies that

[TABLE]

which implies that $z\mapsto\pi_{0}f_{0}(z)/[\pi_{0}f_{0}(z)+(1-\pi_{0})g_{0}(z)]$ is non-decreasing on $[a,b]$ . Therefore,

[TABLE]

is non-increasing on $H_{0}([a,b])=[0,1]$ . Hence, $G_{0}\circ H_{0}^{-1}$ is concave on $[0,1]$ , so

[TABLE]

We now note that since $G_{n}\circ H_{n}^{-}(u)\geq G_{n}(y_{m_{2}})=1$ for any $u\geq h_{m_{2}}$ , $\mathrm{LCM}_{[0,h_{m_{2}}]}(G_{n}\circ H_{n}^{-})=\mathrm{LCM}_{[0,1]}(G_{n}\circ H_{n}^{-})$ . Furthermore, since $G_{n}^{*}$ only jumps at $y_{1},\dotsc,y_{m_{2}}$ , we have

[TABLE]

for any $y\in\mathbb{R}$ , where $\tilde{H}_{n}:=\pi_{n}F_{n}\circ L_{n}+(1-\pi_{n})G_{n}$ for $L_{n}(z):=G_{n}^{-}\circ G_{n}(z)=\max\{Y_{j}:Y_{j}\leq z\}$ .

Using the notation of Section 3.2, we can write

[TABLE]

for $\omega_{x}(d,x):=dI(z\leq x)$ , and similarly $(1-\pi_{n})G_{n}(y)=\mathbb{P}_{n}\eta_{y}$ for $\eta_{y}(d,z):=(1-d)I(z\leq y)$ . We also have $P_{0}\omega_{x}=\pi_{0}F_{0}(x)$ and $P_{0}\eta_{y}=(1-\pi_{0})G_{0}(y)$ . By standard empirical process theory, we therefore have that

[TABLE]

and

[TABLE]

converge weakly (jointly) as processes indexed by $\ell^{\infty}(\mathbb{R})$ to

[TABLE]

for $\mathbb{G}_{1}$ and $\mathbb{G}_{2}$ independent Brownian bridge processes. The two processes are independent because the covariance between the processes is easily seen to be zero. Since the density of $G_{0}$ is bounded strictly away from zero on $[a,b]$ , $n^{1/2}([(1-\pi_{n})G_{n}]^{-}-[(1-\pi_{0})G_{0}]^{-1})$ converges weakly in $\ell^{\infty}(0,1)$ to $-\mathbb{G}_{2}/[(1-\pi_{0})g_{0}]\circ[\pi_{0}G_{0}]^{-1}$ by Lemma 3.9.23 of van der Vaart and Wellner, 1996. Hence, by Hadamard differentiability of the composition map (see Lemma 3.9.27 of van der Vaart and Wellner, 1996), the functional delta method yields

[TABLE]

converges weakly in $\ell^{\infty}[a,b]$ to

[TABLE]

so that $\sup_{z\in[a,b]}|L_{n}(z)-z|=o_{P}(n^{-1/2})$ . Hence, $n^{1/2}(\pi_{n}F_{n}\circ L_{n}-\pi_{0}F_{0})$ converges weakly to $\mathbb{G}_{1}\circ[\pi_{0}F_{0}]$ in $\ell^{\infty}[a,b]$ , and so $n^{1/2}(\tilde{H}_{n}-H_{0})$ converges weakly to $\mathbb{G}_{1}\circ[\pi_{0}F_{0}]+\mathbb{G}_{2}\circ[(1-\pi_{0})G_{0}]$ in $\ell^{\infty}[a,b]$ . Since $G_{0}$ and $F_{0}$ are both continuously differentiable on $[a,b]$ , so is $H_{0}$ , and since the derivative of $G_{0}$ is bounded away from zero, so is the derivative of $H_{0}$ . Therefore, using Lemma 3.9.23 of van der Vaart and Wellner, 1996 again, $n^{1/2}(\tilde{H}_{n}^{-}-H_{0}^{-})$ converges weakly in $\ell^{\infty}(0,1-\varepsilon)$ to $(\mathbb{G}_{1}\circ[\pi_{0}F_{0}]+\mathbb{G}_{2}\circ[(1-\pi_{0})G_{0}])\circ H_{0}^{-1}/(h_{0}\circ H_{0}^{-1})$ for any $\varepsilon>0$ , where $h_{0}:=H_{0}^{\prime}=\pi_{0}f_{0}+(1-\pi_{0})g_{0}$ . Then, using the functional delta method for composition again, we have that $n^{1/2}[G_{n}\circ\tilde{H}_{n}^{-}-G_{0}\circ H_{0}^{-1}]$ converges weakly to

[TABLE]

which we define as $\mathbb{G}_{3}$ . Now by Proposition 2.1 of Beare and Fang (2017),

[TABLE]

converges weakly to $\mathrm{LCM}_{[0,1],G_{0}\circ H_{0}^{-1}}^{\prime}(\mathbb{G}_{3})$ . Using Hadamard differentiability of composition once more, we have that

[TABLE]

converges weakly to

[TABLE]

If $f_{0}/g_{0}$ is strictly increasing on $[a,b]$ , then $G_{0}\circ H_{0}^{-1}$ is strictly concave on $[0,1]$ , in which case $\mathrm{LCM}_{[0,1],G_{0}\circ H_{0}^{-1}}^{\prime}$ is the identity operator by Proposition 2.2 of Beare and Fang (2017). Hence, in this case $n^{1/2}[G_{n}^{*}-G_{0}]$ converges weakly to

[TABLE]

which, as noted above, is the same limit distribution as $n^{1/2}[G_{n}-G_{0}]$ . Furthermore, we have

[TABLE]

When $f_{0}/g_{0}$ is strictly increasing so that $\mathrm{LCM}_{[0,1],G_{0}\circ H_{0}^{-1}}^{\prime}$ is the identity, the functional delta method (e.g. Theorem 3.9.4 of van der Vaart and Wellner, 1996) implies that

[TABLE]

Similarly, since as shown above, $n^{1/2}(\tilde{H}_{n}^{-}\circ\tilde{H}_{n}-\mathrm{Id})$ converges weakly in $[a,b]$ to 0, $n^{1/2}\|G_{n}\circ\tilde{H}_{n}^{-}\circ\tilde{H}_{n}-G_{n}\|_{\infty}=o_{P}(1)$ . Therefore, $n^{1/2}\|G_{n}^{*}-G_{n}\|_{\infty}=o_{P}(1)$ if $f_{0}/g_{0}$ is strictly increasing on $[a,b]$ .

Now we turn attention to $F_{n}^{*}$ . By Theorem 2, for each $y\in\{Y_{1},\dotsc,Y_{n}\}$ , we know that $F_{n}^{*}(y_{k})=\mathrm{GCM}_{[0,h_{m_{2}}]}(F_{n}\circ H_{n}^{-})\circ H_{n}(y_{k})$ . We can extend the GCM operation to entirety of $[0,1]$ , so that $F_{n}^{*}(y_{k})=\mathrm{GCM}_{[0,1]}(F_{n}\circ H_{n}^{-})\circ H_{n}(y_{k})$ , because the slope of the secant of $F_{n}\circ H_{n}^{-}$ from $h_{m_{2}}$ to $H_{n}(x_{j})$ for any $x_{j}>y_{m_{2}}$ is $[F_{n}(x_{j})-F_{n}(y_{m_{2}})]/[H_{n}(x_{j})-h_{m_{2}}]=1/\pi_{n}$ , while the slope of the secant from any other $z$ in the support of $H_{n}$ is $[F_{n}(y_{m_{2}})-F_{n}(z)]/[H_{n}(y_{m_{2}})-H_{n}(z)]\leq[F_{n}(y_{m_{2}})-F_{n}(z)]/[\pi_{n}\{F_{n}(y_{m_{2}})-F_{n}(z)\}]=1/\pi_{n}$ . Therefore, performing the GCM over $[0,1]$ rather than $[0,h_{m_{2}}]$ cannot change the value of the GCM for any $u\leq h_{m_{2}}$ .

We now define $F_{n}^{\ell}(y):=\mathrm{GCM}_{[0,1]}(F_{n}\circ H_{n}^{-})\circ H_{n}\circ L_{n}$ , where $L_{n}=G_{n}^{-}\circ G_{n}$ as above, so that $\bar{F}_{n}$ is the right-continuous step function with jumps at $y_{1},\dotsc,y_{m_{2}}$ and agreeing with $F_{n}^{*}$ at these points. We similarly define $F_{n}^{u}:=\mathrm{GCM}_{[0,1]}(F_{n}\circ H_{n})\circ H_{n}\circ R_{n}$ , where $R_{n}:=G_{n}^{-}\circ\bar{G}_{n}$ for $\bar{G}_{n}(y):=\frac{1}{n}\sum_{i=1}^{n}I(Y_{i}<y)+1/n$ . Since the $Y_{j}$ ’s are unique with probability one, $\bar{G}_{n}$ is a left-continuous version of $G_{n}$ that agrees at $y_{1},\dotsc,y_{m_{2}}$ , and $\bar{G}_{n}\geq G_{n}$ . Therefore, since any MLE $F_{n}^{*}$ is a proper CDF, we have $F_{n}^{\ell}\leq F_{n}^{*}\leq F_{n}^{u}$ . One can show that $\|F_{n}^{\ell}-F_{n}\|_{\infty}=o_{P}(n^{-1/2})$ and $\|F_{n}^{u}-F_{n}\|_{\infty}=o_{P}(n^{-1/2})$ when $f_{0}/g_{0}$ is strictly increasing using the same argument as that used above for showing that $\|G_{n}^{*}-G_{n}\|=o_{P}(n^{-1/2})$ . We then have $\|F_{n}^{*}-F_{n}\|_{\infty}=o_{P}(n^{-1/2})$ as well. ∎

Proof of Theorem 5.

The conditions of Theorem 1 of Westling and Carone (2020) are satisfied by the uniform consistency of empirical distribution functions. ∎

Proof of Theorem 6.

This result follows by the delta method, as discussed in the text. ∎

Additional simulations: discrete case

We now present results from a numerical study of the properties of the maximum likelihood estimator in the case where both $F_{0}$ and $G_{0}$ are fully discrete. We set $F_{0}$ and $G_{0}$ as the distribution functions of Poisson random variables with rates 6 and 4, respectively, and we set $\pi_{0}$ to $0.4$ . We simulated 1000 datasets each for $n\in\{500,1000,5000,10000\}$ and estimated the maximum likelihood estimator $\theta_{n}^{*}$ , the empirical mass ratio function, defined as the ratio of the empirical mass functions of $X_{1},\dotsc,X_{n_{1}}$ and $Y_{1},\dotsc,Y_{n_{2}}$ , and the sample splitting estimators with $m\in\{5,10,20\}$ (Banerjee et al., 2019). We computed Wald-type confidence intervals (constructed around $\log\theta_{n}^{*}$ and exponentiated) using the asymptotic variance provided in Section 4.1 of the main text, likelihood ratio-based confidence intervals, and confidence intervals around the sample splitting estimators as outlined in Section 5 of the main text.

The left panel of Figure 6 displays the distribution of $\theta_{n}^{*}(z)-\theta_{0}(z)$ for $z\in\{0,1,\dotsc,6\}$ , and demonstrates that $\theta_{n}^{*}$ is approximately unbiased in large samples. The right panel of Figure 6 displays the ratio of the empirical standard deviation of $n^{1/2}[\theta_{n}^{*}(z)-\theta_{0}(z)]$ to the standard deviation based on the asymptotic theory, and demonstrates that the empirical standard deviation of $\theta_{n}^{*}(z)$ approaches the standard deviation defined by the limit theory as the sample size grows, and that $\theta_{n}^{*}(z)$ is more efficient than the limit theory suggests in smaller samples for small values of $z$ .

Figure 7 displays the ratio of the mean squared errors of the empirical and sample splitting estimators to that of the maximum likelihood estimator. For the empirical estimator, this ratio approaches one as sample size grows, which agrees with our theoretical result suggesting that the two estimators are asymptotically equivalent. However, in small samples, the maximum likelihood estimator has strictly smaller mean squared error than the empirical estimator. The mean squared errors of the sample splitting estimators also approach that of the maximum likelihood estimator as the sample size grows, which is concurrent with existing theory for $n^{-1/2}$ -rate asymptotics.

Figure 8 shows the empirical coverage of 95% confidence intervals for $\theta_{0}(z)$ constructed using Wald-type confidence intervals with a plug-in standard error according to the results presented in Section 4.1 of the main text, the inverted likelihood ratio test approach of Banerjee and Wellner (2001), and the sample splitting approach of Banerjee et al. (2019) described in the main text. We note that the likelihood ratio approach does not provide intervals at the end point $z=0$ . The plug-in method is conservative in small samples, but its coverage approaches 95% for $z\neq 0$ as $n$ grows. The likelihood ratio method provides excellent coverage at all sample sizes. The sample splitting method has good coverage in large enough sample sizes.

Additional simulations: continuous case

We now present results from a numerical study of the properties of the maximum likelihood estimator in the case where both $F_{0}$ and $G_{0}$ are fully continuous. We set $F_{0}$ and $G_{0}$ as the distribution functions of exponential random variables with rates 1 and 2, respectively, and we set $\pi_{0}$ to $0.4$ . We simulated 1000 datasets each for $n\in\{500,1000,5000,10000\}$ and estimated the maximum likelihood estimator, the maximum smoothed likelihood estimator of Yu et al. (2017), the non-monotone estimator based on kernel density estimates for each $z\in\{0,0.1,\dotsc,1.9,2\}$ , and the sample splitting estimator with $m\in\{5,10,20\}$ (Banerjee et al., 2019). We constructed confidence intervals at each $z$ using the transformed plug-in and likelihood ratio-based methods described in Section 4.2 of the main text.

The left panel of Figure 9 displays the distribution of $\theta_{n}^{*}(z)-\theta_{0}(z)$ for $z\in[0,2]$ , and demonstrates that the sampling distribution of $\theta_{n}^{*}$ is approximately centered around $\theta_{0}(z)$ in large samples for $z>0$ . The right panel of Figure 9 displays the ratio of the empirical standard deviation of $n^{1/2}[\theta_{n}^{*}(z)-\theta_{0}(z)]$ to the standard deviation based on the asymptotic theory, and demonstrates that the empirical standard deviation of $\theta_{n}^{*}(z)$ approaches the standard deviation defined by the limit theory as the sample size grows.

Figure 10 displays the ratio of the mean squared errors of maximum smoothed likelihood estimator, the kernel density estimator, and the sample splitting estimators to the maximum likelihood estimator. The maximum smoothed likelihood estimator is more efficient than the maximum likelihood estimator. The kernel density estimator is more efficient for some values of $z$ , but less efficient for others. In large enough samples, the sample splitting estimators are more efficient than the maximum likelihood estimator, but in smaller samples, they are less efficient for some values of $z$ . The sample size required for improvement grows with $m$ , as does the gain in asymptotic efficiency.

Finally, Figure 11 shows the empirical coverage of 95% confidence intervals for $\theta_{0}(z)$ constructed using Wald-type confidence intervals with a plug-in standard error according to the results presented in Section 4.2 of the main text, the inverted likelihood ratio test approach of Banerjee and Wellner (2001), and the sample splitting approach of Banerjee et al. (2019) described in the main text. The plug-in method is conservative in large enough samples due to the difficulty of accurately estimating the derivative of $\theta_{0}$ . The likelihood ratio method provides slightly conservative coverage at all sample sizes. The sample splitting method has excellent coverage for $m=5$ , but requires larger samples to have good coverage for $m=10$ .

Additional simulations: flat case with jumps

Here we present results from a numerical study of the properties of the various estimators in the case where $F_{0}$ and $G_{0}$ are mixed distributions, and $\theta_{0}=dF_{0}/dG_{0}$ is discontinuous. We set $F_{0}:=(2/3)F_{0}^{c}+(1/3)\delta_{0}$ , where $F_{0}^{c}$ is the uniform distribution on $[0,1]$ and $\delta_{0}$ is a discrete distribution with mass $1/6$ at [math], $1/3$ at $1/2$ , and $1/2$ at $1$ . We set $G_{0}:=(2/3)F_{0}^{c}+(1/3)\gamma_{0}$ , where $\gamma_{0}$ is a discrete distribution with mass $1/3$ each at 0, $1/2$ , and $1$ . We set $\pi_{0}$ to $0.4$ .

With these definitions, we have $\theta_{0}(x)=1/2$ for $x=0$ , $\theta_{0}(x)=1$ for $x\in(0,1)$ , and $\theta_{0}(x)=3/2$ for $x=1$ . Hence, $\theta_{0}$ has jumps at the extremal mass points $x=0$ and $x=1$ , and is flat between these mass points. Therefore, our large-sample theory does not cover this case for two reasons: because $\theta_{0}$ is flat in the interior, and because it is discontinuous at the boundaries.

We simulated 1000 datasets each for $n\in\{500,1000,5000,10000\}$ and estimated the maximum likelihood estimator, the maximum smoothed likelihood estimator of Yu et al. (2017), the non-monotone estimator based on kernel density estimates for each $z\in\{0,0.1,\dotsc,1.9,2\}$ , and the sample splitting estimator with $m\in\{5,10\}$ (Banerjee et al., 2019). We constructed confidence intervals at each $z$ using the transformed plug-in and likelihood ratio-based methods described in Section 4.2 of the main text. We were unable to use the plug-in method of constructing confidence intervals because it failed in this case due to the difficulty of estimating the derivative of a flat function.

Figure 12 displays the distribution of $\theta_{n}^{*}(z)-\theta_{0}(z)$ for $z\in[0,1]$ . The pattern is quite interesting. For $z\in\{0,1\}$ , the estimator appears to be centered around the truth. This suggests that the estimator may be consistent at mass points even if the function is discontinuous at these points or these points lie on the boundary of the domain. However, for $z\in(0,0.25)$ , the distribution of $\theta_{n}^{*}(z)$ is biased downward, and for $z\in(0.75,1)$ , the distribution is biased upward. This is likely due to the discontinuity of $\theta_{0}$ at 0 and 1: although the estimator is consistent for any $z\in(0,1)$ , in any finite sample the estimator is flat in a region of the discontinuity, which biases the finite-sample distribution of the estimator near these discontinuities. We will see below that this also makes inference in these areas challenging.

Figure 13 displays the ratio of the mean squared errors of maximum smoothed likelihood estimator, the kernel density estimator, and the sample splitting estimators to the maximum likelihood estimator. The maximum smoothed likelihood estimator is comparable to the maximum likelihood estimator for $n\in\{500,1000\}$ , but is less efficient for most $z$ in larger samples. This is especially true for $z\in\{0,1\}$ , where the maximum likelihood estimator appears to benefit from the mass points. The kernel density estimator is less efficient, and in large samples much less efficient, for all $z$ except those very close to [math] and $1$ . Somewhat surprisingly, the sample splitting estimators are less efficient than the maximum likelihood estimator except for $z$ near the mass points. This is likely due to the fact that the sample splitting estimator inherit the bias of the maximum likelihood estimator at a smaller sample size, and the maximum likelihood estimator is biased near the points of discontinuity.

Finally, Figure 14 shows the empirical coverage of 95% confidence intervals for $\theta_{0}(z)$ constructed using the inverted likelihood ratio test approach of Banerjee and Wellner (2001) and the sample splitting approach of Banerjee et al. (2019) described in the main text. None of the methods do well near $z\in\{0,1\}$ due to the bias of the estimators in these regions. The likelihood ratio method provides conservative coverage at all sample sizes for $z$ near $1/2$ , which is because it relies on limit theory that only holds when $\theta_{0}$ is strictly increasing. The sample splitting method has good coverage for $z$ at the mass points $\{0,1/2,1\}$ , but poor coverage otherwise.

Additional data analysis results

Figure 15 displays the empirical and likelihood ratio order maximum likelihood cumulative distribution function estimates of C-reactive protein for patients with bacterial infections and those without. Figure 16 displays the empirical and likelihood ratio order maximum likelihood ordinal dominance curve estimates for C-reactive protein.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arcones and Samaniego (2000) Arcones, M. A. and Samaniego, F. J. (2000). On the Asymptotic Distribution Theory of a Class of Consistent Estimators of a Distribution Satisfying a Uniform Stochastic Ordering Constraint. Ann. Stat. , 28(1):116–150.
2Bamber (1975) Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psychol. , 12(4):387 – 415.
3Banerjee (2007) Banerjee, M. (2007). Likelihood based inference for monotone response models. Ann. Statist. , 35(3):931–956.
4Banerjee et al. (2019) Banerjee, M., Durot, C., and Sen, B. (2019). Divide and conquer in nonstandard problems and the super-efficiency phenomenon. Ann. Statist. , 47(2):720–757.
5Banerjee and Wellner (2001) Banerjee, M. and Wellner, J. A. (2001). Likelihood ratio tests for monotone functions. Ann. Statist. , 29(6):1699–1731.
6Beare and Fang (2017) Beare, B. K. and Fang, Z. (2017). Weak convergence of the least concave majorant of estimators for a concave distribution function. Electron. J. Statist. , 11(2):3841–3870.
7Beare and Moon (2015) Beare, B. K. and Moon, J.-M. (2015). Nonparametric tests of density ratio ordering. Econom. Theory , 31(3):471–492.
8Brunk (1970) Brunk, H. D. (1970). Estimation of isotonic regression. In Nonparametric Techniques in Statistical Inference (Proc. Sympos., Indiana Univ., Bloomington, Ind., 1969) , pages 177–197, London. Cambridge Univ. Press.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Nonparametric maximum likelihood estimation

Abstract

1 Introduction

2 Likelihood ratio orders

Theorem 1**.**

3 Estimation under a likelihood ratio order

3.1 Maximum likelihood estimator

Theorem 2**.**

Corollary 1**.**

3.2 Representation as a transformation of isotonic regression

4 Asymptotic results

4.1 Discrete distributions

Theorem 3** (Discrete distributions).**

4.2 Continuous distributions

Theorem 4**.**

Theorem 5** (Consistency).**

Theorem 6** (Pointwise convergence in distribution).**

5 Numerical studies

6 Analysis of C-reactive protein for predicting bacterial infection

7 Discussion

Acknowledgments

Example of the use of Theorem 2

Proof of Theorems

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Corollary 1.

Proof of Theorem 3.

Proof of Theorem 4.

Proof of Theorem 5.

Proof of Theorem 6.

Additional simulations: discrete case

Additional simulations: continuous case

Additional simulations: flat case with jumps

Additional data analysis results

Theorem 1.

Theorem 2.

Corollary 1.

Theorem 3 (Discrete distributions).

Theorem 4.

Theorem 5 (Consistency).

Theorem 6 (Pointwise convergence in distribution).