Nonparametric maximum likelihood estimation under a likelihood ratio order
Ted Westling, Kevin J. Downes, Dylan S. Small

TL;DR
This paper develops a nonparametric maximum likelihood estimator for two distributions under the likelihood ratio order, applicable to various distribution types, with theoretical convergence results and practical applications.
Contribution
It introduces a novel nonparametric estimator for distributions and their density ratio under likelihood ratio order, extending to discrete, continuous, and mixed cases.
Findings
Estimator converges in distribution in certain cases
Numerical experiments validate the estimator's performance
Application to biomarker data demonstrates practical utility
Abstract
Comparison of two univariate distributions based on independent samples from them is a fundamental problem in statistics, with applications in a wide variety of scientific disciplines. In many situations, we might hypothesize that the two distributions are stochastically ordered, meaning intuitively that samples from one distribution tend to be larger than those from the other. One type of stochastic order that arises in economics, biomedicine, and elsewhere is the likelihood ratio order, also known as the density ratio order, in which the ratio of the density functions of the two distributions is monotone non-decreasing. In this article, we derive and study the nonparametric maximum likelihood estimator of the individual distributions and the ratio of their densities under the likelihood ratio order. Our work applies to discrete distributions, continuous distributions, and mixed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Distribution Estimation and Applications · Bayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference
Nonparametric maximum likelihood estimation
under a likelihood ratio order
Ted Westling1
Kevin J. Downes2,3,4
Dylan S. Small5
(1Department of Mathematics and Statistics, University of Massachusetts Amherst
2Center for Pediatric Clinical Effectiveness, Children’s Hospital of Philadelphia
3Division of Infectious Diseases, Children’s Hospital of Philadelphia
4Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania
5Department of Statistics, The Wharton School, University of Pennsylvania)
Abstract
Comparison of two univariate distributions based on independent samples from them is a fundamental problem in statistics, with applications in a variety of scientific disciplines. In many situations, we might hypothesize that the two distributions are stochastically ordered, meaning that samples from one distribution tend to be larger than those from the other. One type of stochastic order is the likelihood ratio order, in which the ratio of the density functions of the two distributions is monotone non-decreasing. In this article, we derive and study the nonparametric maximum likelihood estimator of the individual distribution functions and the ratio of their densities under the likelihood ratio order. Our work applies to discrete distributions, continuous distributions, and mixed continuous-discrete distributions. We demonstrate convergence in distribution of the estimator in certain cases, and we illustrate our results using numerical experiments and an analysis of a biomarker for predicting bacterial infection in children with systemic inflammatory response syndrome.
1 Introduction
Comparing the distributions of two independent samples is a fundamental problem in statistics. Suppose that and are independent real-valued samples with distribution functions and , respectively. In many situations, we might hypothesize that and are stochastically ordered, meaning intuitively that samples from tend to be larger than those from . A particular type of stochastic order that arises in many applications is the likelihood ratio order. Specifically, and satisfy a likelihood ratio order if the density ratio is monotone non-decreasing over the support of , where and for some dominating measure . For this reason, the likelihood ratio order is also called a density ratio order.
A likelihood ratio order can arise for a variety of scientific reasons (Beare and Moon, 2015; Roosen and Hennessy, 2004; Dykstra et al., 1995; Yu et al., 2017). In the biomedical sciences and elsewhere, the ratio of two density functions is an object of interest for describing the relative likelihood of a binary status indicator conditional on a covariate. If is a binary random variable, is a scalar random variable, , , and , then
[TABLE]
Therefore, the density ratio in this context may be interpreted as the relative odds of given to the overall odds of . Since the transformation is strictly increasing, monotonicity of the density ratio is equivalent to monotonicity of . One specific situation in which the representation given in (1) is of scientific interest is biomarker evaluation, wherein represents infection status and represents the value of a biomarker. Equation (1) implies that the ratio of the densities of biomarker values among infected patients to the same among uninfected patients can be interpreted as the odds ratio of infection given biomarker level relative to overall odds of infection. Monotonicity of the density ratio corresponds to the assumption that the conditional probability of infection given biomarker level increases with biomarker level, which is reasonable if the biomarker is actually predictive of disease.
In this article, we derive the nonparametric maximum likelihood estimators of , , and under the likelihood ratio order restriction and derive certain asymptotic properties of these estimators, including consistency and convergence in distribution. In particular, we use a connection between estimation of and the classical isotonic regression problem with a binary outcome, which both simplifies the derivation of large-sample results and suggests that existing inference methods for the isotonic regression problem can be used to perform inference for as well. Our results generalize those of Dykstra et al. (1995), who derived the maximum likelihood estimator of and under a likelihood ratio order in the special case where and are discrete distributions. We will illustrate our results using numerical experiments and an analysis of a biomarker for predicting bacterial infection in children with systemic inflammatory response syndrome.
Recently, Yu et al. (2017) considered estimation of a monotone density ratio and the individual density functions by maximizing a smoothed likelihood function, and demonstrated certain asymptotic properties of their estimator. Yu et al. (2017) considered maximizing a smoothed likelihood rather than maximizing the likelihood directly because the maximum likelihood estimator of the individual densities does not exist. In contrast, we will show that, using a definition of the likelihood ratio ordered model based on convexity of the ordinal dominance curve, a well-defined nonparametric maximum likelihood estimator of the monotone density ratio function and the individual distribution functions (rather than the density functions) does exist. Furthermore, unlike the smoothed estimator, the derivation of the maximum likelihood estimator does not require selection of a bandwidth or any other tuning parameter, and does not rely on the existence of Lebesgue density functions.
Additional relevant references include: Lehmann and Rojo (1992) and Shaked and Shanthikumar (2007), which contain more examples and details regarding stochastic orders, Carolan and Tebbs (2005) and Beare and Moon (2015), which studied tests of the likelihood ratio order, and Rojo and Samaniego (1991), Rojo and Samaniego (1993), Mukerjee (1996), Arcones and Samaniego (2000), Davidov and Herman (2012), and Tang et al. (2017), which considered testing and estimation under other stochastic orders.
2 Likelihood ratio orders
We observe two independent real-valued samples and , with distribution functions and , respectively. We define as the support of and as the support of . We denote , and by and the empirical distribution functions of and , respectively. We define as the unique values of , as the unique values of , and as the unique values of . Throughout, we assume that is fixed, but that is drawn from a Binomial distribution for some .
We let be the space of distribution functions on ; i.e. all non-decreasing, cádlàg functions such that and . For any nondecreasing function , we define its generalized-inverse pointwise as . When , is called the quantile function of . For any interval and any function , we define the greatest convex minorant (GCM) of on , denoted , for the extended real line, as the pointwise supremum of all convex functions on bounded above by . The least concave majorant operator is defined analogously. We say a function is convex over a set if for every and such that , . We also define as the left derivative operator for a left differentiable function and as the image of a function defined on a domain .
The unrestricted nonparametric model for the pair of distribution functions of the observed data is . As mentioned in the introduction, the likelihood ratio order can be defined as the ratio of the density functions and of and with respect to some dominating measure being non-decreasing. By varying the dominating measure, both discrete and continuous distributions can be handled this way. However, as noted by Yu et al. (2017), this definition does not lend itself to the derivation of a maximum likelihood estimator, since the likelihood defined through the densities can be made arbitrarily large. Instead, other authors have defined the likelihood ratio order as convexity of the ordinal dominance curve, defined as for (Bamber, 1975; Hsieh and Turnbull, 1996). Lehmann and Rojo (1992) demonstrated the equivalence of this definition to that using the density functions in the special case that and are strictly increasing and continuous on their supports, which were assumed to be intervals. Alternatively, Shaked and Shanthikumar (2007) defined the likelihood ratio order as for all measurable sets with , where (with some abuse of notation) and means that for all and .
In Theorem 1 below, we consolidate and generalize existing results connecting these different definitions of the likelihood ratio order.
Theorem 1**.**
If and is continuous on the support of , then (1) the following are equivalent: is convex on , is non-decreasing on , and for all measurable sets ; and (2) if is non-decreasing on then for all .
To our knowledge, Theorem 1 is the most general result to-date connecting the three definitions of the likelihood ratio ordered model. We note that the three definitions may not be equivalent when is not dominated by or is not continuous. For instance, in the proof of Theorem 1 part (1), we only use the assumption that is continuous on to show that is convex on implies that is non-decreasing, but we show that if is non-decreasing, then for all (the definition used in Shaked and Shanthikumar, 2007) regardless of whether is continuous. Additionally, we show that for all implies that is convex on regardless of whether or is continuous. However, to show the converse, we use . For a simple counterexample when is not dominated by , consider and , where . Then for , which is convex on , but for . Finally, when but is not continuous on , whether being convex on implies that for all measurable sets , or even all such Borel sets, is unclear to us.
Throughout the remainder of the article, we say satisfy a likelihood ratio order, and write if is convex on . We then define the likelihood ratio ordered model as all such that . For any , we further define as , where is defined as the set of non-negative, non-decreasing functions on . We note that this definition allows for the possibility that is not dominated by , but by Theorem 1, for all such that and is continuous on , on . We define .
In the context of the likelihood ratio order, many existing works either assume that and are discrete (e.g. Dykstra et al., 1995) or that and are continuous (e.g. Lehmann and Rojo, 1992; Yu et al., 2017). In the discrete setting, if and are discrete distributions with common support and mass functions and such that , then on . Alternatively, if and both possess Lebesgue density functions and and , then on . However, for the purpose of deriving a maximum likelihood estimator, we will demonstrate that these two cases do not need to be treated separately. Furthermore, in some applied settings, and are neither discrete nor continuous, but rather a mixture of discrete and continuous components, and we will derive results that apply in these situations as well. For instance, exposures that are bounded below may have positive mass at their lower boundary, and be continuous thereafter. Many biomarkers exhibit this property. Similarly, some measurements are “clumpy”, exhibiting positive mass at integers or other “round” numbers due to the measurement process, but also possessing positive Lebesgue density between such points. In all cases, has a meaningful interpretation as the ratio of the conditional odds of a sample being from the distribution to the unconditional odds of a sample being from .
3 Estimation under a likelihood ratio order
3.1 Maximum likelihood estimator
The pair determines the joint distribution of the observed data. Defining the nonparametric likelihood of the observed data as , the nonparametric maximum likelihood estimator of , i.e. in the model , is for the empirical distribution function of , and the same of . This suggests taking as an estimator of the plug-in estimator . The function is known as the empirical ordinal dominance curve, and is properties were studied by Hsieh and Turnbull (1996).
In this section, we demonstrate, amongst other results, that is the maximum likelihood estimator of in the likelihood ratio ordered model . A maximum likelihood estimator of in is defined as , and a maximum likelihood estimator of is defined as .
We define as the empirical distribution of the combined sample , and for . Our first result characterizes .
Theorem 2**.**
Let be the value at of the GCM over of and be the value at of the LCM over of . Then is a right-continuous step function with jumps at with and is given by a right-continuous step function with jumps at , where , and for any such that , where , the mass of at is given by
[TABLE]
For any such that , the mass of at is given by
[TABLE]
We also note that and .
A proof of Theorem 2, and proofs of all other theorems, are provided in Supplementary Material. We note that necessarily has jumps at all and at all such that , and has jumps at all . We also note that if there are such that no but , then there are infinitely many maximizers because any that assigns mass to the interval yields the same likelihood and satisfies the constraints. In these cases, for the sake of uniqueness, we will put mass at the point .
Theorem 2 implies the following result characterizing .
Corollary 1**.**
The points lie on the GCM over of the empirical ordinal dominance curve
[TABLE]
where . Specifically, if are the vertices of the GCM of , then are the vertices of the GCM of the empirical ordinal dominance curve. Therefore, is equal to .
Theorem 2 bears resemblance to, but is different than, Theorem 2.1 of Dykstra et al. (1995), which characterized the maximum likelihood estimator under a likelihood ratio order in the discrete case. Here, we perform the maximization over all pairs of univariate distribution functions such that is convex on the support of , whereas Theorem 2.1 of Dykstra et al. (1995) performed the maximization over with support contained in and such that is nondecreasing. The first set is strictly larger than the second, which results in possibly different maximum likelihood estimators. In particular, our maximum likelihood estimator is only supported on , whereas the maximum likelihood estimator of derived by Dykstra et al. (1995) may have support on that are not equal to any . This difference makes sense in the context of our respective problem formulations: Dykstra et al. (1995) assumed that the supports of and are subsets of , while we do not assume the supports are known a priori. In Supplementary Material, we illustrate the use of Theorem 2 using hypothetical data in which our maximum likelihood estimators and are different from those of Dykstra et al. (1995).
3.2 Representation as a transformation of isotonic regression
Dykstra et al. (1995) and Carolan and Tebbs (2005) provided representations of the maximum likelihood estimators of and in terms of isotonic regression in the discrete and continuous cases, respectively. Here, we show that can also be represented as a transformation of an isotonic regression, which aids in deriving its asymptotic properties. We let be independent Bernoulli random variables with common probability and such that . Letting be the indices such that for each , we then define for each . Similarly, letting be the indices such that for each , we define for each . Defining the data unit , observing the independent samples from and from is then equivalent to observing independent observations from , where satisfies
[TABLE]
Thus, represent the pooled values of , and each represents an indicator that corresponds to a sample from . Furthermore, , , and . Estimating given the independent samples and is therefore equivalent to estimating given independent observations from , where .
The benefit to the above reframing of the problem is that , , and can then be written as transformations of . First, we have that , where and is the odds transformation, defined as . Since is strictly increasing, is monotone if and only if is. Since the maximum likelihood estimator of under the assumption that is non-decreasing is given by the isotonic regression of on , and the maximum likelihood estimator of is given by , the maximum likelihood estimator of is then given by . It is straightforward to see that this form of the maximum likelihood estimator is equivalent to the forms given above. In the next section, we will utilize this form of to derive its asymptotic properties and to construct asymptotic confidence intervals.
4 Asymptotic results
4.1 Discrete distributions
We first consider the situation where has finite support and is strictly increasing on . The next result demonstrates that in this case, and are asymptotically equivalent to and , respectively, and is asymptotically equivalent to the ratio of the empirical masses on the support of .
Theorem 3** (Discrete distributions).**
Suppose that the support of is a finite set and that for , where . Then and with probability tending to one, so that with probability tending to one is a left-continuous step function with jumps at and and for . As a result, for
[TABLE]
where .
We note that the above result does not require that be discrete as well, or be dominated by . If is dominated by , then corresponds to the ratio of the mass functions.
4.2 Continuous distributions
Now we address the situation where and are both absolutely continuous on and , which now corresponds to the ratio of the density functions, is strictly increasing. We first consider the large-sample behavior of and .
Theorem 4**.**
Suppose that is supported on a bounded interval , that and possess continuous density functions and on such that is strictly increasing on , and on . Then and .
Theorem 4 demonstrates that when is strictly increasing, the maximum likelihood estimators of the individual distribution functions are asymptotically equivalent to the empirical distribution functions at the rate , and hence possess the same limit distributions as the empirical distribution functions. This result is proved using the functional delta method and the results of Beare and Fang (2017), who demonstrated that the LCM operation is a directionally Hadamard differentiable mapping at any concave function.
We now turn to large-sample results for at points where both and possess Lebesgue density functions and , respectively. First, consistency of implies consistency of .
Theorem 5** (Consistency).**
If is continuous at , is continuous at , and , then . If and are uniformly continuous on , then for any strict sub-interval .
We recall that, at any such that is positive and continuous in a neighborhood of , , and is continuously differentiable in a neighborhood of , it holds that
[TABLE]
where follows Chernoff’s distribution, defined as the point of maximum of for a two-sided standard Brownian motion originating from zero (Brunk, 1970; Groeneboom and Jongbloed, 2014). We can then use the delta-method to see that
[TABLE]
The scale parameter in the above limit distribution is equal to for
[TABLE]
This yields the following result.
Theorem 6** (Pointwise convergence in distribution).**
Suppose that, in a neighborhood of , is continuously differentiable with , and and are positive and continuous. Then
Theorem 6 reflects certain common tradeoffs in the monotonicity-constrained literature. Theorem 6 indicates that the non-smoothed estimator converges pointwise at the rate. In contrast, the smoothed estimator proposed by Yu et al. (2017) converges at the faster rate, albeit under stronger smoothness assumptions. While Yu et al. (2017) did not propose a method for conducting inference, smoothed estimators typically possess an asymptotic bias that complicates the task of performing valid inference. In contrast, the limit distribution in Theorem 6 has mean zero, which we can use to construct asymptotically valid confidence intervals. Defining as an estimator of and the quantile of , a Wald-type confidence interval for is given by . If , then this interval has asymptotic coverage of . The quantiles of were computed by Groeneboom and Wellner (2001), and in particular .
In practice, we recommend an alternative method to constructing confidence intervals for . We recommend first constructing confidence intervals for using either of two existing methods, then transforming these intervals into intervals for . Specifically, if represents a confidence interval for , then we take as a confidence interval for . Two existing ways to construct are Wald-type intervals with plug-in estimation of nuisance parameters and intervals based on likelihood ratio tests. The former intervals are analogous to the Wald-type interval, but based on the limit distribution for given in (2). Alternatively, confidence intervals obtained by inverting likelihood ratio tests, proposed first by Banerjee and Wellner (2001) and studied further by, e.g. Banerjee (2007) and Groeneboom and Jongbloed (2015), can be formed based on the limiting distribution of twice the log of the ratio of the likelihoods of the maximum likelihood estimator and a suitably constrained maximum likelihood estimator. Since this limiting distribution is pivotal, meaning it does not depend on any unknown features of the true distribution, this approach does not require estimating any unknown nuisance parameters. We therefore expect this method to have better finite-sample properties than intervals based on plug-in estimation of nuisance parameters.
5 Numerical studies
In Supplementary Material, we present results of two simulation studies in the cases where and are fully discrete and fully continuous. In short, these studies confirm the validity of our large-sample theory and demonstrate that the maximum likelihood estimator and various proposed methods of conducting inference perform well in both cases. Here, we present the results of a numerical study illustrating the behavior of when and are mixed discrete-continuous distributions. We note that our asymptotic results did not address the behavior of at mass points in mixed discrete-continuous distributions; to the best of our knowledge, no such results yet exist for monotone estimators. We use this numerical study to explore this important case.
We simulated as a mixed discrete-continuous random variable with probability 1/9 each of being 0, 0.5 and 1, and probability 2/3 of being from the uniform distribution on , and simulated as a mixed discrete-continuous random variable with probabilities 1/18, 1/9, and 3/18 of being 0, 0.5, and 1, respectively and probability 2/3 of being from the density function . We then have that for . We set . For each combined sample size , we simulated datasets, and in each dataset we computed the maximum likelihood estimator, the maximum smoothed likelihood estimator of Yu et al. (2017), and the non-monotone estimator based on kernel density estimates for each . We constructed confidence intervals at each using the transformed plug-in and likelihood ratio-based methods described in Section 4.2. To estimate the scale parameter in the limit distribution of as defined in equation 2, we used the plug-in estimator for and estimated using the derivative of a local linear smoother of evaluated at .
In addition to the properties of the estimators listed above, we also investigated the properties of the general sample-splitting procedure proposed by Banerjee et al. (2019). Given a generic monotone estimator of a monotone function such that for a mean-zero distribution with finite variance, Banerjee et al. (2019) proposed randomly splitting the sample into subsets of roughly equal size, computing monotone estimates in each subset, then defining . They demonstrated that if is fixed, then under mild conditions has strictly better asymptotic mean squared error than , and that for moderate , forms an asymptotic confidence interval for , where and is the quantile of the -distribution with degrees of freedom. Therefore, is preferable to for two reasons: it has better asymptotic mean squared error, and asymptotically valid pointwise confidence intervals for based on can be formed without estimating any nuisance parameters. They also studied the asymptotic properties of when grows with . In our simulation study, we considered the estimator defined as , where is the maximum likelihood estimator in the th subset, and the corresponding confidence intervals defined above. We only considered the situation where is fixed with the sample size.
We now turn to the results of the simulation study. The left panel of Figure 1 displays the distribution of for and . These distributions are approximately centered around 0 for , but not for . Hence, despite the positive mass at the boundaries, the maximum likelihood estimator does not appear to be consistent at the boundaries. This is a common problem among monotonicity-constrained estimators, and various correction procedures have been proposed and could be considered in this context (see, e.g. Woodroofe and Sun, 1993; Kulikov and Lopuhaä, 2006).
The right panel of Figure 1 displays the ratio of the standard deviation of to the standard deviation of the asymptotic distributions derived in Section 4 for . For , and the asymptotic distribution is that of the fully discrete case presented in Section 4.1, though we note that the results presented in that section do not apply here due to the mixed discrete-continuous nature of and here. Otherwise, and the asymptotic distribution is that of the continuous case presented in Section 4.2. We see that, for , the empirical standard error approaches the asymptotic standard deviation as grows. However, for , the empirical standard error is converging to a limit that is strictly smaller than the asymptotic standard deviation. This suggests that, at points that have both positive mass and positive density in a neighborhood of the point, the maximum likelihood estimator gains efficiency from the positive density. In addition, points of continuity near the mass point also experience finite-sample efficiency gains.
Figure 2 shows the ratio of the mean squared errors of the maximum smoothed likelihood estimator, the kernel density-based estimator, and the sample splitting estimators to that of the maximum likelihood estimator. The maximum smoothed likelihood estimator is slightly more efficient than the maximum likelihood estimator at continuity points, but is less efficient around mass points. Furthermore, the relative performance of the maximum likelihood estimator at positive mass points increases as the sample size grows. The kernel density estimator is generally less efficient than the maximum likelihood estimator, especially near mass points, and the discrepancy also grows with the sample size.
For large enough , the sample splitting estimator is more efficient than the maximum likelihood estimator at all points at which the latter is consistent. The relative improvement of grows with the number of splits , as does the sample size required for to outperform .
Figure 3 shows the empirical coverage of 95% confidence intervals for constructed using the plug-in method described in Section 4.2, the inverted likelihood ratio test approach of Banerjee and Wellner (2001), and the sample splitting approach of Banerjee et al. (2019) described above. We note that the likelihood ratio approach does not provide intervals at the end points or . The plug-in method is conservative in large samples near mass points, but anti-conservative at some points of positive density. This is because the plug-in method is designed to work when the distributions are fully continuous, and estimation of the required nuisance parameters in the limit distribution fails in the presence of mass points. The likelihood ratio method is conservative in smaller samples, but approaches nominal coverage in large samples for points of absolute continuity. The sample splitting method with has adequate coverage for all sample sizes except for close to the boundaries. The sample splitting method with (and similarly for , which is not shown) appears to require very large sample sizes to attain adequate coverage over a large range of . We note that the sample splitting methods was able to achieve good coverage in large samples at both interior absolutely continuous points and interior mass points, without the user specifying which points are which.
6 Analysis of C-reactive protein for predicting bacterial infection
In this section, we use the methods presented herein to assess the use of the biomarker C-reactive protein (CRP) for determining the presence or absence of bacterial infection in children with systemic inflammatory response syndrome (SIRS). The Optimizing Antibiotic Strategies in Sepsis (OASIS) II study enrolled a prospective observational cohort of children under the age of nineteen at the pediatric intensive care unit at The Children’s Hospital of Philadelphia from August 2012 to June 2016 (Downes et al., 2018). Patients were enrolled in the study if they presented signs of SIRS, were started on a new broad-spectrum antibiotic for suspected bacterial infection, and had blood cultures taken within six hours of SIRS onset. A primary goal of the study was to assess whether CRP, which had previously been found to be predictive of bacterial infection (Downes et al., 2017), could be used to determine when antibiotic therapy could be safely ended. Additional details of the study design and results of the primary analysis may be found in Downes et al. (2018).
We analyzed all patients in the OASIS II cohort with measured biomarkers and bacterial infection status to assess the odds of bacterial infection as a function of CRP value. Some patients had measurements at multiple episodes; since all such episodes were at least 30 days apart, we treated these episodes as independent of one another. We analyzed a total of CRP measurements among 443 unique patients, with bacterial infections among 191 unique patients and non-infections among 266 unique patients.
Since CRP has previously been found to be predictive of bacterial infection in this patient population, there is scientific reason to believe that the density ratio order holds. We therefore computed the MLE of the density ratio function and corresponding 95% likelihood ratio-based pointwise confidence intervals and the sample splitting estimator of Banerjee et al. (2019) with splits and corresponding 95% pointwise confidence intervals.
Figure 4 displays the estimated odds of bacterial infection given CRP value relative to the population odds of bacterial infection and 95% pointwise confidence intervals. We find that values of CRP under 1 are indicative of roughly quartered odds of infection relative to the population odds of infection, and values of CRP greater than 20 are indicative of roughly doubled odds of infection relative to the population odds. Values of CRP between 1 and 20 do not clearly indicate that a patient’s odds of infection are larger or smaller than the population odds.
7 Discussion
In this article, we have considered nonparametric maximum likelihood inference for the density ratio function and the individual distribution functions under the assumption that the density ratio is nondecreasing. We applied these methods to the analysis of the biomarker C-reactive protein for predicting bacterial infection in children with systemic inflammatory response syndrome. The methods apply broadly to biomarker analysis, as well as other areas of biomedical research.
One of our important contributions is the ability to deal with discrete, continuous, and mixed discrete-continuous distributions. Such distributions arise frequently in applied settings, and in particular in the context of biomarker analysis. Furthermore, we have demonstrated via numerical studies that sample splitting provides good pointwise confidence interval coverage without knowing which values correspond to discrete mass points and which correspond to points of Lebesgue continuity of the underlying densities, which is important because in practice analysts may not have such knowledge a priori. However, a theoretical treatment of the precise asymptotic behavior of the estimator at mass points remains unknown, and would be an interesting topic of future research.
Acknowledgments
The authors thank Craig Boge for help compiling the OASIS II data. The authors also gratefully acknowledge the constructive comments of the editors and anonymous reviewers and the support of the CDC Epicenters program (KJD), NICHD grant K23HD091365 (KJD), the Center for Pediatric Clinical Effectiveness and the Pediatric IDEAS Research Group of the Children’s Hospital of Philadelphia (KJD, TW), and the Department of Pediatrics of the University of Pennsylvania Perelman School of Medicine.
Example of the use of Theorem 2
We first illustrate the use of Theorem 2 and Corollary 1 using hypothetical data. Suppose that and . We first derive . The points are given by , and its GCM is given by . This is displayed in the upper left panel of Figure 5. The values of the GCM imply that , , , and . We then have that and . The estimators and are compared in the bottom left panel of Figure 5.
We next derive . The points are given by , and its LCM is given by . This is displayed in the center left panel of Figure 5. The values of the LCM imply that , , , and . The estimators and are compared in the bottom left panel of Figure 5.
Finally, we derive . The empirical ordinal dominance curve is given by the points , , , , , and the vertices of its GCM are given by , , . This is displayed in the bottom left panel of Figure 5. The left-hand slopes of the GCM are on the interval and on the interval , which implies that for and for . This is displayed in the bottom right panel of Figure 5.
We note that the maximum likelihood estimators of and of derived in Dykstra et al. (1995) for the fully discrete case are different than and . In particular, both and have jumps at all the unique values of the data with values , , , , and , and ; and , , , , , and . However, the maximum likelihood estimator is equal to for each .
Proof of Theorems
Proof of Theorem 1.
We first suppose that and is non-decreasing on , and we show that for all measurable . Recall that when is a set, and means that for all and . Since , we have that for all . We then have by Fubini’s Theorem that
[TABLE]
Now since is non-decreasing and for all and , we have
[TABLE]
Finally, applying Fubini’s Theorem again yields
[TABLE]
Next, we suppose that for all measurable , and we show that is convex on . Let , where and for . We then let and , which are both Borel sets satisfying since is necessarily non-decreasing. We then have and similarly . In addition, since for any , we also have and similarly . We then have by assumption that
[TABLE]
Therefore, , which implies that , which shows that is convex on .
Finally, we suppose that , is convex on , and is continuous on , and we show that is nondecreasing on . This is the most difficult of the three implications. The basic argument amounts to using convexity of to compare the slopes of chords or sequences of chords, and to relate these slopes to values of . Let with . Suppose that we can find sequences and such that converges to , converges to , and for all large enough. Then, by convexity of , for all large enough, which implies that . The exact form of and depends on how looks near and . In particular, there are three cases for : (1) and there exists such that ; (2) but there is no such that ; and (3) . We begin by specifying in each case.
In case (1), we take for all . Since , we must have , so that for all . In case (2), it must be that . In this case, there exists increasing to such that for each , increases to and increases to . We then have that increases to , so that increases to . In case (3), we first note that since . Additionally, since , there exist in with for each that either (a) increases to and for each , or (b) decreases to and for each . In either case, we have
[TABLE]
For any , by continuity of over , we can find such that implies for all . If (a) holds and is bounded above, we then have for all , so that then . If is not bounded above then , so that trivially. If (b) holds then is bounded below by zero, so by a similar calculation .
The three cases for are similar: (1) and there exists such that ; (2) but there is no such ; and (3) . In case (1), we take for all . Since , we must have , so that for all . In case (2), it must be that , and again there exists an increasing sequence increasing to such that for each , increases to and increases to . We then have that increases to , so that increases to . In case (3), , and since , there exists in with for each that either (a) increases to and for each , or (b) decreases to and for each . If (a) holds and is bounded above, then converges to by continuity of as before. If is not bounded above then converges to . If (b) holds then is bounded below by zero, so again .
Of the nine pairings of cases for and cases for , the only situation in which it is not immediately clear that for all large enough is that decreases to (case 3b) and for all (case 1). However, we note that if and only if , which would imply that case (3b) cannot hold for . Therefore, if decreases to and , then , so that for all large enough. This completes the argument.
Finally, we address statement (2) of the result: we suppose that and is continuous and non-decreasing on , and we show that on . By (1), is convex on . First, we claim that , where takes the following form. For any , . If , then there exists and such that . We then define . Thus, is the linear interpolation of to . In order to show that indeed equals , we need to show that (a) is convex, (b) , and (c) for any other convex minorant of .
For (a), we let and for . There then exist which are all elements of and such that , , and , and furthermore , , and . The remainder of the argument is best seen with a diagram. Let be the point , be the point , and so on. By convexity of , the line segment lies below or on the line segment , which lies below or on , which lies below or on . Therefore, , which falls on , is no greater than , which falls on .
For (b), by definition, for any . If , then , and hence . As a result, .
We have now shown that is a convex minorant of . For (c), if is another convex minorant of , then clearly for all . If , then . If , then . If , then there must be an such that for all , so that for each , where and as . Taking the limit as , we have that .
We now have that , so it remains to show that for all . First, if , then for all for . Therefore, for all such , so that . If instead and then and it is straightforward to see from the definition of that . ∎
Proof of Theorem 2.
We first note that for any such that for any . As a result, we may restrict our attention to such that for all , which implies that has support at each . For any such , we define , where . We then have for each . Furthermore, the support of is is contained in the support of , for each , and is by assumption convex on the support of . Therefore, is convex on the support of , so that and . Hence, we may further restrict our attention to which are discrete with jumps at . By a similar argument, we can restrict our attention to which are discrete with jumps at or .
We define , and , so that the support of for any discrete with jumps at is , and . Defining and the number of such that , we have . We then define for each , and we note that if and only if . Suppose that the values are fixed in such a way as to satisfy these constraints. We denote by for , where , and by the number of such that . Noting that are disjoint with union , we then have
[TABLE]
Additionally, for each , we must have that . Therefore, maximizing with respect to with fixed amounts to maximizing subject to for each . This implies that a maximizer must satisfy
[TABLE]
for each . Therefore, is proportional to for , which is the number of in the interval .
We note that if there are such that no but , then there are infinitely many maximizers because any that assigns mass to the interval yields the same likelihood and satisfies the constraints. In these cases, for the sake of uniqueness we will put mass at the point .
We have at this point reduced the problem to maximizing
[TABLE]
subject to and . Letting for , this is equivalent to maximizing
[TABLE]
subject to and . The term involving is maximized for .
From this point we take a similar approach to that in Dykstra et al. (1995). We define , and , so that and . Optimizing with with respect to and such that and is equivalent to optimizing
[TABLE]
such that , , and , where and .
Now, such that is maximized for . Next, maximizing with respect to is equivalent to maximizing
[TABLE]
for and . By Theorem 2.1 and Exercise 2.21 of Groeneboom and Jongbloed (2014), the maximizer of this expression over all is given by the weighted isotonic regression of with weights . By Lemma 2.1 of Groeneboom and Jongbloed (2014), is equal to the left derivative of the GCM of the set of points
[TABLE]
evaluated at . We note that . Therefore, we have that for all such that and such that . Since and also satisfy , this implies that is an optimizer of over the set of stated constraints.
We now have that and . Since , this implies that and , where is the value of the GCM of the set of points defined above at . We note that , for the value of the GCM of evaluated at . Additionally, for the value of the LCM of at . ∎
Proof of Corollary 1.
From the proof of Theorem 2, we have that and . Let denote the indices of the vertices of the GCM of . Then for each and . It is also straightforward to see that is a convex minorant of if and only if is a convex minorant of . Therefore, form the vertices of the GCM of . ∎
Proof of Theorem 3.
We note that for each with probability tending to one. Then, since the support of is finite, with probability tending to one the empirical ODC is a left-continuous step function with vertices at , where we note that almost surley. We define
[TABLE]
which is positive by assumption. We then have
[TABLE]
Now since is uniformly consistent for and is uniformly consistent for , and for each , the second through fifth lines above are uniformly over . Therefore,
[TABLE]
which implies that
[TABLE]
for all with probability tending to one. Therefore, with probability tending to one, the diagram of points , is convex. Now by Corollary 1, the points lie on the GCM of , . But these points being convex means that they are equal to their GCM, so that with probability tending to one and for each . We can then see by Theorem 2 that with probability tending to one as well, so that with probability tending to one. We then have with probability tending to one that
[TABLE]
for each . since the GCM of is with probability tending to one piecewise linear with knots at the and , we then have that with probability tending to one that is a left-continuous step function with jumps at the . Also, since for and for all , for .
We now have
[TABLE]
Using the notation introduced in Section 3.2, this can be written as
[TABLE]
where for , , , and , and . By the Central Limit Theorem,
[TABLE]
where the element of the covariance matrix equals . Applying the delta method to the function yields (after some algebra) , where equals
[TABLE]
∎
Proof of Theorem 4.
We note that for all in implies that
[TABLE]
which implies that is non-decreasing on . Therefore,
[TABLE]
is non-increasing on . Hence, is concave on , so
[TABLE]
We now note that since for any , . Furthermore, since only jumps at , we have
[TABLE]
for any , where for .
Using the notation of Section 3.2, we can write
[TABLE]
for , and similarly for . We also have and . By standard empirical process theory, we therefore have that
[TABLE]
and
[TABLE]
converge weakly (jointly) as processes indexed by to
[TABLE]
for and independent Brownian bridge processes. The two processes are independent because the covariance between the processes is easily seen to be zero. Since the density of is bounded strictly away from zero on , converges weakly in to by Lemma 3.9.23 of van der Vaart and Wellner, 1996. Hence, by Hadamard differentiability of the composition map (see Lemma 3.9.27 of van der Vaart and Wellner, 1996), the functional delta method yields
[TABLE]
converges weakly in to
[TABLE]
so that . Hence, converges weakly to in , and so converges weakly to in . Since and are both continuously differentiable on , so is , and since the derivative of is bounded away from zero, so is the derivative of . Therefore, using Lemma 3.9.23 of van der Vaart and Wellner, 1996 again, converges weakly in to for any , where . Then, using the functional delta method for composition again, we have that converges weakly to
[TABLE]
which we define as . Now by Proposition 2.1 of Beare and Fang (2017),
[TABLE]
converges weakly to . Using Hadamard differentiability of composition once more, we have that
[TABLE]
converges weakly to
[TABLE]
If is strictly increasing on , then is strictly concave on , in which case is the identity operator by Proposition 2.2 of Beare and Fang (2017). Hence, in this case converges weakly to
[TABLE]
which, as noted above, is the same limit distribution as . Furthermore, we have
[TABLE]
When is strictly increasing so that is the identity, the functional delta method (e.g. Theorem 3.9.4 of van der Vaart and Wellner, 1996) implies that
[TABLE]
Similarly, since as shown above, converges weakly in to 0, . Therefore, if is strictly increasing on .
Now we turn attention to . By Theorem 2, for each , we know that . We can extend the GCM operation to entirety of , so that , because the slope of the secant of from to for any is , while the slope of the secant from any other in the support of is . Therefore, performing the GCM over rather than cannot change the value of the GCM for any .
We now define , where as above, so that is the right-continuous step function with jumps at and agreeing with at these points. We similarly define , where for . Since the ’s are unique with probability one, is a left-continuous version of that agrees at , and . Therefore, since any MLE is a proper CDF, we have . One can show that and when is strictly increasing using the same argument as that used above for showing that . We then have as well. ∎
Proof of Theorem 5.
The conditions of Theorem 1 of Westling and Carone (2020) are satisfied by the uniform consistency of empirical distribution functions. ∎
Proof of Theorem 6.
This result follows by the delta method, as discussed in the text. ∎
Additional simulations: discrete case
We now present results from a numerical study of the properties of the maximum likelihood estimator in the case where both and are fully discrete. We set and as the distribution functions of Poisson random variables with rates 6 and 4, respectively, and we set to . We simulated 1000 datasets each for and estimated the maximum likelihood estimator , the empirical mass ratio function, defined as the ratio of the empirical mass functions of and , and the sample splitting estimators with (Banerjee et al., 2019). We computed Wald-type confidence intervals (constructed around and exponentiated) using the asymptotic variance provided in Section 4.1 of the main text, likelihood ratio-based confidence intervals, and confidence intervals around the sample splitting estimators as outlined in Section 5 of the main text.
The left panel of Figure 6 displays the distribution of for , and demonstrates that is approximately unbiased in large samples. The right panel of Figure 6 displays the ratio of the empirical standard deviation of to the standard deviation based on the asymptotic theory, and demonstrates that the empirical standard deviation of approaches the standard deviation defined by the limit theory as the sample size grows, and that is more efficient than the limit theory suggests in smaller samples for small values of .
Figure 7 displays the ratio of the mean squared errors of the empirical and sample splitting estimators to that of the maximum likelihood estimator. For the empirical estimator, this ratio approaches one as sample size grows, which agrees with our theoretical result suggesting that the two estimators are asymptotically equivalent. However, in small samples, the maximum likelihood estimator has strictly smaller mean squared error than the empirical estimator. The mean squared errors of the sample splitting estimators also approach that of the maximum likelihood estimator as the sample size grows, which is concurrent with existing theory for -rate asymptotics.
Figure 8 shows the empirical coverage of 95% confidence intervals for constructed using Wald-type confidence intervals with a plug-in standard error according to the results presented in Section 4.1 of the main text, the inverted likelihood ratio test approach of Banerjee and Wellner (2001), and the sample splitting approach of Banerjee et al. (2019) described in the main text. We note that the likelihood ratio approach does not provide intervals at the end point . The plug-in method is conservative in small samples, but its coverage approaches 95% for as grows. The likelihood ratio method provides excellent coverage at all sample sizes. The sample splitting method has good coverage in large enough sample sizes.
Additional simulations: continuous case
We now present results from a numerical study of the properties of the maximum likelihood estimator in the case where both and are fully continuous. We set and as the distribution functions of exponential random variables with rates 1 and 2, respectively, and we set to . We simulated 1000 datasets each for and estimated the maximum likelihood estimator, the maximum smoothed likelihood estimator of Yu et al. (2017), the non-monotone estimator based on kernel density estimates for each , and the sample splitting estimator with (Banerjee et al., 2019). We constructed confidence intervals at each using the transformed plug-in and likelihood ratio-based methods described in Section 4.2 of the main text.
The left panel of Figure 9 displays the distribution of for , and demonstrates that the sampling distribution of is approximately centered around in large samples for . The right panel of Figure 9 displays the ratio of the empirical standard deviation of to the standard deviation based on the asymptotic theory, and demonstrates that the empirical standard deviation of approaches the standard deviation defined by the limit theory as the sample size grows.
Figure 10 displays the ratio of the mean squared errors of maximum smoothed likelihood estimator, the kernel density estimator, and the sample splitting estimators to the maximum likelihood estimator. The maximum smoothed likelihood estimator is more efficient than the maximum likelihood estimator. The kernel density estimator is more efficient for some values of , but less efficient for others. In large enough samples, the sample splitting estimators are more efficient than the maximum likelihood estimator, but in smaller samples, they are less efficient for some values of . The sample size required for improvement grows with , as does the gain in asymptotic efficiency.
Finally, Figure 11 shows the empirical coverage of 95% confidence intervals for constructed using Wald-type confidence intervals with a plug-in standard error according to the results presented in Section 4.2 of the main text, the inverted likelihood ratio test approach of Banerjee and Wellner (2001), and the sample splitting approach of Banerjee et al. (2019) described in the main text. The plug-in method is conservative in large enough samples due to the difficulty of accurately estimating the derivative of . The likelihood ratio method provides slightly conservative coverage at all sample sizes. The sample splitting method has excellent coverage for , but requires larger samples to have good coverage for .
Additional simulations: flat case with jumps
Here we present results from a numerical study of the properties of the various estimators in the case where and are mixed distributions, and is discontinuous. We set , where is the uniform distribution on and is a discrete distribution with mass at [math], at , and at . We set , where is a discrete distribution with mass each at 0, , and . We set to .
With these definitions, we have for , for , and for . Hence, has jumps at the extremal mass points and , and is flat between these mass points. Therefore, our large-sample theory does not cover this case for two reasons: because is flat in the interior, and because it is discontinuous at the boundaries.
We simulated 1000 datasets each for and estimated the maximum likelihood estimator, the maximum smoothed likelihood estimator of Yu et al. (2017), the non-monotone estimator based on kernel density estimates for each , and the sample splitting estimator with (Banerjee et al., 2019). We constructed confidence intervals at each using the transformed plug-in and likelihood ratio-based methods described in Section 4.2 of the main text. We were unable to use the plug-in method of constructing confidence intervals because it failed in this case due to the difficulty of estimating the derivative of a flat function.
Figure 12 displays the distribution of for . The pattern is quite interesting. For , the estimator appears to be centered around the truth. This suggests that the estimator may be consistent at mass points even if the function is discontinuous at these points or these points lie on the boundary of the domain. However, for , the distribution of is biased downward, and for , the distribution is biased upward. This is likely due to the discontinuity of at 0 and 1: although the estimator is consistent for any , in any finite sample the estimator is flat in a region of the discontinuity, which biases the finite-sample distribution of the estimator near these discontinuities. We will see below that this also makes inference in these areas challenging.
Figure 13 displays the ratio of the mean squared errors of maximum smoothed likelihood estimator, the kernel density estimator, and the sample splitting estimators to the maximum likelihood estimator. The maximum smoothed likelihood estimator is comparable to the maximum likelihood estimator for , but is less efficient for most in larger samples. This is especially true for , where the maximum likelihood estimator appears to benefit from the mass points. The kernel density estimator is less efficient, and in large samples much less efficient, for all except those very close to [math] and . Somewhat surprisingly, the sample splitting estimators are less efficient than the maximum likelihood estimator except for near the mass points. This is likely due to the fact that the sample splitting estimator inherit the bias of the maximum likelihood estimator at a smaller sample size, and the maximum likelihood estimator is biased near the points of discontinuity.
Finally, Figure 14 shows the empirical coverage of 95% confidence intervals for constructed using the inverted likelihood ratio test approach of Banerjee and Wellner (2001) and the sample splitting approach of Banerjee et al. (2019) described in the main text. None of the methods do well near due to the bias of the estimators in these regions. The likelihood ratio method provides conservative coverage at all sample sizes for near , which is because it relies on limit theory that only holds when is strictly increasing. The sample splitting method has good coverage for at the mass points , but poor coverage otherwise.
Additional data analysis results
Figure 15 displays the empirical and likelihood ratio order maximum likelihood cumulative distribution function estimates of C-reactive protein for patients with bacterial infections and those without. Figure 16 displays the empirical and likelihood ratio order maximum likelihood ordinal dominance curve estimates for C-reactive protein.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Arcones and Samaniego (2000) Arcones, M. A. and Samaniego, F. J. (2000). On the Asymptotic Distribution Theory of a Class of Consistent Estimators of a Distribution Satisfying a Uniform Stochastic Ordering Constraint. Ann. Stat. , 28(1):116–150.
- 2Bamber (1975) Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psychol. , 12(4):387 – 415.
- 3Banerjee (2007) Banerjee, M. (2007). Likelihood based inference for monotone response models. Ann. Statist. , 35(3):931–956.
- 4Banerjee et al. (2019) Banerjee, M., Durot, C., and Sen, B. (2019). Divide and conquer in nonstandard problems and the super-efficiency phenomenon. Ann. Statist. , 47(2):720–757.
- 5Banerjee and Wellner (2001) Banerjee, M. and Wellner, J. A. (2001). Likelihood ratio tests for monotone functions. Ann. Statist. , 29(6):1699–1731.
- 6Beare and Fang (2017) Beare, B. K. and Fang, Z. (2017). Weak convergence of the least concave majorant of estimators for a concave distribution function. Electron. J. Statist. , 11(2):3841–3870.
- 7Beare and Moon (2015) Beare, B. K. and Moon, J.-M. (2015). Nonparametric tests of density ratio ordering. Econom. Theory , 31(3):471–492.
- 8Brunk (1970) Brunk, H. D. (1970). Estimation of isotonic regression. In Nonparametric Techniques in Statistical Inference (Proc. Sympos., Indiana Univ., Bloomington, Ind., 1969) , pages 177–197, London. Cambridge Univ. Press.
