Model-free posterior inference on the area under the receiver operating characteristic curve
Zhe Wang, Ryan Martin

TL;DR
This paper introduces a model-free Gibbs posterior approach for estimating the AUC of binary classifiers, avoiding restrictive distributional assumptions and providing reliable credible intervals, with strong empirical performance.
Contribution
It develops a novel model-free Gibbs posterior method for AUC inference, addressing limitations of traditional binormality-based approaches.
Findings
Gibbs posterior achieves accurate AUC estimation without distributional assumptions
Credible intervals from the Gibbs posterior have nominal frequentist coverage
Method outperforms existing rank likelihood-based approaches in simulations and real data
Abstract
The area under the receiver operating characteristic curve (AUC) serves as a summary of a binary classifier's performance. Methods for estimating the AUC have been developed under a binormality assumption which restricts the distribution of the score produced by the classifier. However, this assumption introduces an infinite-dimensional nuisance parameter and can be inappropriate, especially in the context of machine learning. This motivates us to adopt a model-free Gibbs posterior distribution for the AUC. We present the asymptotic Gibbs posterior concentration rate, and a strategy for tuning the learning rate so that the corresponding credible intervals achieve the nominal frequentist coverage probability. Simulation experiments and a real data analysis demonstrate the Gibbs posterior's strong performance compared to existing methods based on a rank likelihood.
| Bias | Standard Error | Mean Length | Coverage Prob. | |||||
|---|---|---|---|---|---|---|---|---|
| Gibbs | BRL | Gibbs | BRL | Gibbs | BRL | Gibbs | BRL | |
| 25 | 0.002 | 0.016 | 0.035 | 0.043 | 0.134 | 0.165 | 0.902 | 0.972 |
| 50 | 0.000 | 0.007 | 0.026 | 0.026 | 0.103 | 0.102 | 0.922 | 0.931 |
| 75 | 0.000 | 0.003 | 0.021 | 0.020 | 0.084 | 0.076 | 0.939 | 0.894 |
| 100 | 0.000 | 0.003 | 0.018 | 0.016 | 0.070 | 0.063 | 0.935 | 0.879 |
| 125 | 0.001 | 0.010 | 0.017 | 0.014 | 0.067 | 0.055 | 0.940 | 0.857 |
| Bias | Standard Error | Mean Length | Coverage Prob. | |||||
|---|---|---|---|---|---|---|---|---|
| Gibbs | BRL | Gibbs | BRL | Gibbs | BRL | Gibbs | BRL | |
| 25 | 0.005 | 0.022 | 0.020 | 0.035 | 0.072 | 0.132 | 0.997 | 0.949 |
| 50 | 0.001 | 0.006 | 0.015 | 0.017 | 0.058 | 0.065 | 0.912 | 0.904 |
| 75 | 0.000 | 0.001 | 0.013 | 0.012 | 0.051 | 0.047 | 0.919 | 0.902 |
| 100 | 0.000 | 0.002 | 0.012 | 0.010 | 0.046 | 0.040 | 0.931 | 0.907 |
| 125 | 0.000 | 0.004 | 0.011 | 0.009 | 0.043 | 0.036 | 0.944 | 0.861 |
| Bias | Standard Error | Mean Length | Coverage Prob. | |||||
|---|---|---|---|---|---|---|---|---|
| Gibbs | BRL | Gibbs | BRL | Gibbs | BRL | Gibbs | BRL | |
| 25 | 0.002 | 0.020 | 0.065 | 0.064 | 0.255 | 0.246 | 0.919 | 0.922 |
| 50 | 0.002 | 0.016 | 0.046 | 0.044 | 0.180 | 0.173 | 0.933 | 0.900 |
| 75 | 0.000 | 0.011 | 0.037 | 0.035 | 0.145 | 0.138 | 0.921 | 0.887 |
| 100 | 0.000 | 0.008 | 0.032 | 0.030 | 0.126 | 0.117 | 0.936 | 0.897 |
| 125 | 0.001 | 0.003 | 0.029 | 0.027 | 0.113 | 0.104 | 0.934 | 0.890 |
| Bias | Standard Error | Mean Length | Coverage Prob. | |||||
|---|---|---|---|---|---|---|---|---|
| Gibbs | BRL | Gibbs | BRL | Gibbs | BRL | Gibbs | BRL | |
| 25 | 0.000 | 0.025 | 0.066 | 0.063 | 0.258 | 0.243 | 0.925 | 0.902 |
| 50 | 0.000 | 0.024 | 0.045 | 0.043 | 0.176 | 0.168 | 0.937 | 0.844 |
| 75 | 0.001 | 0.020 | 0.037 | 0.033 | 0.144 | 0.130 | 0.930 | 0.788 |
| 100 | 0.000 | 0.003 | 0.032 | 0.028 | 0.125 | 0.109 | 0.942 | 0.861 |
| 125 | 0.000 | 0.020 | 0.029 | 0.026 | 0.112 | 0.100 | 0.938 | 0.803 |
| Gibbs1 | Gibbs2 | BRL1 | BRL2 | |
|---|---|---|---|---|
| Posterior mean | 0.705 | 0.705 | 0.691 | 0.697 |
| Standard error | 0.045 | 0.046 | 0.046 | 0.041 |
| Credible interval | (0.615, 0.795) | (0.615, 0.796) | (0.598, 0.774) | (0.612, 0.775) |
| Learning rate | 0.052 | 0.051 | — | — |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Model-free posterior inference on the area under the receiver operating characteristic curve
Zhe Wang111Department of Statistics, North Carolina State University; [email protected], [email protected] and Ryan Martin∗
Abstract
The area under the receiver operating characteristic curve (AUC) serves as a summary of a binary classifier’s performance. Methods for estimating the AUC have been developed under a binormality assumption which restricts the distribution of the score produced by the classifier. However, this assumption introduces an infinite-dimensional nuisance parameter and can be inappropriate, especially in the context of machine learning. This motivates us to adopt a model-free Gibbs posterior distribution for the AUC. We present the asymptotic Gibbs posterior concentration rate, and a strategy for tuning the learning rate so that the corresponding credible intervals achieve the nominal frequentist coverage probability. Simulation experiments and a real data analysis demonstrate the Gibbs posterior’s strong performance compared to existing methods based on a rank likelihood.
Keywords and phrases: credible interval; Gibbs posterior; generalized Bayesian inference; model misspecification; robustness.
1 Introduction
First proposed during World War II to assess the performance of radar receiver operators (Calì and Longobardi 2015), the receiver operating characteristic (ROC) curve is now an essential tool for analyzing the performance of binary classifiers in areas such as signal detection (Green and Swets 1966), psychology examination (Swets 1973, 1986), radiology (Lusted 1960; Hanley and McNeil 1982), medical diagnosis (Swets and Pickett 1982; Hanley 1989), and data mining (Spackman 1989; Fawcett 2006). One informative summary of the ROC curve is the corresponding area under the curve (AUC). This measure provides an overall assessment of classifier’s performance, independent of the choice of threshold, and is, therefore, the preferred method for evaluating classification algorithms (Provost and Fawcett 1997; Provost et al. 1998; Bradley 1997; Huang and Ling 2005). The AUC is an unknown quantity, and our goal is to use the information contained in the data to make inference about the AUC. The specific set up is as follows. For a binary classifier which produces a random score to indicate the propensity for, say, Group 1; individuals with scores higher than a threshold are classified to Group 1, the rest are classified to Group 0. Let and be independent scores corresponding to Group 1 and Group 0, respectively. Given a threshold , define the specificity and sensitivity as and . Then the ROC curve is a plot of the parametric curve \bigl{(}1-\mathsf{spec}(t),\mathsf{sens}(t)\bigr{)} as takes all possible values for scores. While the ROC curve summarizes the classifier’s tradeoff between sensitivity and specificity as the threshold varies, the AUC measures the probability of correctly assigning scores for two individuals from two groups, which equals (Bamber 1975), and is independent of the choice of threshold. Consequently, the AUC is a functional of the joint distribution of , denoted by , so the ROC curve is actually not needed to identify AUC.
In the context of inference on the AUC, when the scores are continuous, it is common to assume that satisfies a so-called binormality assumption, which states that there exists a monotone increasing transformation that maps both and to normal random variables (Hanley 1988). For most medical diagnostic tests, where the classifiers are simple and ready-to-use without training, such an assumption serves well (Hanley 1988; Metz et al. 1998; Cai and Moskowitz 2004), although it has been argued that other distributions can be more appropriate for some specific tests (e.g., Guignard and Salehi 1983; Goddard and Hinberg 1990). But for complicated classifiers which involve multiple predictors, as often arise in machine learning applications, binormality—or any other model assumption for that matter—becomes a burden. This motivates our pursuit of a “model-free” approach to inference about the AUC.
Specifically, our goal is the construction of a type of posterior distribution for the AUC. The most familiar such construction is via Bayes’s formula, but this requires a likelihood function and, hence, a statistical model. The only way one can be effectively “model-free” within a Bayesian framework is to make the model extra flexible, which requires lots of parameters. In the extreme case, a so-called Bayesian nonparametric approach would take the distribution itself as the model parameter. When the model includes lots of parameters, then the analyst has the burden of specifying prior distributions for these, based on little or no genuine prior information, and also computation of a high-dimensional posterior. But since the AUC is just a one-dimensional feature of this complicated set of parameters, there is no obvious return on the investment into prior specification and posterior computation. A better approach would be to construct the posterior distribution for the AUC directly, using available prior information about the AUC only, without specifying a model and without the introduction of artificial model parameters. That way, the data analyst can avoid the burdens of prior specification and posterior computation, bias due to model misspecification, and issues that can arise as a result of non-linear marginalization (e.g., Martin 2019; Fraser 2011).
As an alternative to the traditional Bayesian approach, we consider here the construction of a so-called Gibbs posterior for the AUC. In general, the Gibbs posterior construction proceeds by defining the quantity of interest as the minimizer of a suitable risk function, treating an empirical version of that loss function like a negative log-likelihood, and then combining with a prior distribution very much like in Bayes’s formula. General discussion of Gibbs posteriors can be found in Zhang (2006a, b), Bissiri et al. (2016) and Alquier et al. (2016); statistical applications are discussed in Jiang and Tanner (2008) and Syring and Martin (2017, 2019a, 2019b). Again, the advantage is that the Gibbs posterior avoids model misspecification bias and the need to deal with unimportant nuisance parameters. Moreover, under suitable conditions, Gibbs posteriors can be shown to have desirable asymptotic concentration properties, with theory that parallels that of Bayesian posteriors under model misspecification (e.g., Kleijn and van der Vaart 2006, 2012).
A subtle point is that, while the risk minimization problem that defines the quantity of interest is independent of the scale of the loss function, the Gibbs posterior is not. This scale factor is often referred to as the learning rate (e.g., Grünwald 2012) and, because it controls the spread of the Gibbs posterior, its specification needs to be handled carefully. Various approaches to the specification of the learning rate parameter (e.g., Grünwald 2012; Grünwald and Van Ommen 2017; Bissiri et al. 2016; Holmes and Walker 2017; Lyddon et al. 2019). Here we adopt the approach in Syring and Martin (2019a) that aims to set the learning rate so that, in addition to its robustness to model misspecification and asymptotic concentration properties, the Gibbs posterior credible sets have the nominal frequentist coverage probability. When the sample size is large, we recommend an (asymptotically) equivalent calibration method that is simpler to compute.
The present paper is organized as follows. In Section 2.1, we review some methods for making inference on the AUC based on the binormality assumption, in particular, the Bayesian approach in Gu and Ghosal (2009) that involves a suitable rank-based likelihood. In Section 2.2, we argue that the binormality assumption is generally inappropriate in machine learning applications, and provide one illustrative example involving a support vector machine. This difficulty with model specification leads us to the Gibbs posterior, a model-free alternative to a Bayesian posterior, which is reviewed in Section 2.3. We develop the Gibbs posterior for inference on the AUC, derive its asymptotic concentration properties, and investigate how to properly scale the risk function in Section 3. Simulation experiments are carried out in Section 4, where a Gibbs posterior estimator performs favorably compared with the Bayesian approach based on a rank-based likelihood. We also apply the Gibbs posterior on a real dataset for evaluating the performance of a biomarker for pancreatic cancer and compare our result with those based on the rank likelihood. Finally, we give some concluding remarks in Section 5.
2 Background
2.1 Binormality and related methods
Following Hanley (1988), the scores and satisfy the binormality assumption if their distribution functions are and respectively, where , , is a monotone increasing function, and denotes the distribution function, which implies that and can be transformed to and via . If denotes the distribution of under this assumption, then the ROC curve and the AUC, respectively, are given by and
[TABLE]
Even though is not needed to define the AUC—only —since the joint distribution of does depend on , any likelihood-based method would have to deal with this infinite-dimensional nuisance parameter. Some strategies are used to avoid dealing with directly. The semi-parametric approach in Cai and Moskowitz (2004) manipulates the equivalent densities ratio of over and over , and introduces cumulative hazard function as a nuisance parameter. A profile likelihood is obtained based on a discrete estimate for the cumulative hazard function. In the approach of Metz et al. (1998), data are suitably grouped and a multinomial pseudo-likelihood is constructed. Alternatively, since data ranks are invariant to monotone transformations, one can construct a rank-based likelihood, as in Zou and Hall (2000), which can be maximized over to estimate the AUC. But it turns out that a Bayesian approach that uses Monte Carlo sampling from a rank-based posterior distribution, as in Gu and Ghosal (2009), is computationally more efficient than maximizing the rank likelihood. Since this is our proposed method’s primary competitor, we give some details about Gu and Ghosal’s Bayesian rank-based likelihood approach here.
Consider the transformed scores and , according to the binormality assumption, its joint distribution can be written as , no more dependence on . Elimination of the nuisance parameter is desirable, but are unavailable to us without knowledge of . That is, unless we consider a function of that is invariant to transformations by . A good candidate function is the ranks. That is, let denote the ranks of the vector , where and are independent and identically distributed (iid) copies of and , respectively. Then
[TABLE]
where is the ranks of , with and . The key is that the observed ranks based on the sample can be plugged in for on the right-hand side of (2) and that gives a likelihood function for , without requiring knowledge of . Of course, this is not a proper likelihood function, i.e., there is loss of information caused by throwing away the values of , but eliminating the infinite-dimensional nuisance parameter might be worth the price, especially when the goal is inference on the ROC curve or AUC, neither of which depend directly on . The approach outlined in Gu and Ghosal (2009) proceeds by treating the values as latent variables and defining a full posterior for , given , and then marginalizing out to get a posterior distribution for alone. If we take the Jeffreys prior for , which is proportional to , then the full conditional distribution presented in Gu and Ghosal (2009) are
[TABLE]
where, e.g., , denotes the inverse gamma distribution with density , and denotes the indicator function. With these full conditionals, it is straightforward to develop a Monte Carlo strategy that produces samples from the posterior distribution. These samples can then be used to get a posterior distribution for AUC using the expression in (1).
2.2 Validity of binormality in machine learning applications
Before the ROC and AUC analysis were introduced to machine learning area, the binormality assumption had been proposed and used in the context of medical diagnosis for simple classifiers, where the scores and are determined based on a single predictor variable. When assuming binormality for classifiers in machine learning (e.g., Brodersen et al. 2010; Macskassy and Provost 2004), the situation differs because multiple predictor variables are usually involved.
Suppose that a binary indicates the group, is the predictors, and denotes the joint distribution of . As described in Section 1, a binary classifier provides a parametric form of the predictor, namely the score , to indicate the propensity for taking value 1. A training process is generally needed for estimating the unknown based on a set of observations . Let the estimator be denoted as . It follows that the random score for Group 1 is defined as where the predictor follows the conditioned distribution . Similarly, the random score for Group 0 is defined as where . By assuming that converges to a non-random quantity when the size of the training set goes to infinity, and are asymptotically independent.
The binormality assumption for simple classifiers, which are special cases where and , only requires the and to be normals (after the transformation ). In the general case, where , even if every one of the predictors obeys the binormality assumption, the scores for two groups are still not guaranteed to satisfy the binormality, since can take virtually any form.
For example, consider two independent and identically distributed predictors and , given different groups ( or [math]), the predictors are distributed as or , respectively. For training data (Figure 1(a)), copies of are generated for each group. A support vector machine with radial basis function kernel is applied to this non-linearly separable dataset and correspondingly the predicted scores for another new data copies under the same data generating scheme are recorded as and . Then the unique monotone increasing transformation which transforms to be standard normal is approximated by , where is the empirical distribution. The histogram of in Figure 1(b) does not agree with the fitted normal density. And a Q-Q plot in Figure 1(c) for samples also suggest there is no such which transforms and to a model that satisfies the binormality assumption.
2.3 Gibbs posterior distributions
A Gibbs posterior distribution resembles a Bayesian posterior, but is constructed using different ingredients. In particular, the Gibbs posterior does not start with a statistical model and likelihood, it starts with a more general connection between data and quantities of interest, through a loss function. Suppose that data are identically distributed -valued observations from distribution , and that there is some functional , taking values in , about which inference is desired. Instead of introducing a statistical model for —that is, assuming takes a particular distributional form for some model parameter , and then expressing as a function of —we construct a posterior for directly as follows. Assume that there exists a loss function , mapping to , such that the true value, , of solves the optimization problem
[TABLE]
where the risk function is just the expected loss with respect to . When the quantity of interest is defined as the solution to an optimization problem, it makes sense to estimate that quantity by solving an empirical version of the optimization problem,
[TABLE]
where the empirical risk is the expected loss with respect to the empirical distribution , with the point-mass distribution concentrated at . From this empirical risk function, the Gibbs posterior distribution is defined as
[TABLE]
where is a prior distribution on and is a scale parameter to be determined; see Bissiri et al. (2016) for the decision-theoretic underpinnings of this approach.
For us, the motivation behind the use of a Gibbs posterior is that it gives us direct, model-free posterior inference about the quantity of interest. This is beneficial because, for one thing, a statistical model could be misspecified and that would generally bias the results. But even if the model is correctly specified, it is unlikely that an appropriate statistical model could be described in terms of alone, so the model index would include a number of nuisance parameters that require prior distribution specification and posterior computation, efforts that are effectively wasted if marginal inference on is the goal. The Gibbs posterior, by targeting directly, avoids the possible misspecification bias, allows for prior beliefs about to be readily accommodated, and does not require dealing with nuisance parameters. And the applications presented in Syring and Martin (2017, 2019a, 2019b), along with the one presented here, suggest that this direct approach has a number of important advantages over the more traditional Bayesian counterpart.
Of course, the magnitude of the loss function does not affect the solution to the optimization problem in (3), nor that in the empirical version thereof. But the magnitude does affect the Gibbs posterior in (4), which is why we include the scaling factor . Data-driven strategies for specifying this tuning parameter are discussed in Section 3.3 below.
3 Gibbs posterior for the AUC
3.1 Definition
As mentioned, the AUC is a functional of the joint distribution of , i.e., , given by . Recall that the data consists of independent copies and of and , respectively. To construct a Gibbs posterior distribution for as discussed above, we need an appropriate loss function. That is, we need a function such that the corresponding risk function, , is minimized at the true AUC, . If we define
[TABLE]
then it is easy to check that
[TABLE]
and, moreover, that this risk function is uniquely minimized at . Then the empirical risk function is
[TABLE]
where is the empirical distribution of the score pairs. Note that the minimizer of the empirical risk function, namely,
[TABLE]
is the familiar statistic suggested by Mann and Whitney (1947) for testing if one of two independent random variables is stochastically larger than the other.
Following the general approach described in Section 2.3, we can construct a Gibbs posterior distribution for the AUC, with density
[TABLE]
where is some prior density for the AUC, and is the learning rate to be specified in Section 3.3. This Gibbs posterior does not require any model assumptions, does not require marginalization over nuisance parameters, and can directly incorporate available prior information about . Moreover, the Gibbs posterior is approximately centered around , which is a quality estimator of the AUC, regardless of what form the underlying distribution takes, so we can expect the Gibbs posterior—for suitable —to provide quality model-free inference. Details on the asymptotic concentration properties of the Gibbs posterior are presented in the next section.
After some simple algebra, the Gibbs posterior above can be re-expressed as
[TABLE]
which shows some resemblance to a truncated normal distribution. A very reasonable choice of prior is a truncated normal distribution with informative choices of prior location and scale . With this choice, the Gibbs posterior is a truncated normal distribution too, with corresponding location and scale, respectively,
[TABLE]
In the absence of prior information about the AUC, one can take a flat uniform prior, , in which case the Gibbs posterior is still a truncated normal distribution but with location and scale, respectively,
[TABLE]
In practice, we recommend the use of any available prior information about the AUC whenever possible, but, for the rest of this paper, we will work with the Gibbs posterior based on the default uniform prior.
Here we are concerned with inference on AUC for a given classifier, and consequently the posterior is constructed directly for the AUC. Ridgway et al. (2014) also construct a Gibbs posterior using AUC, but their goal is to find a classifier that maximizes AUC.
3.2 Asymptotic concentration properties
It is natural to ask what kind of asymptotic concentration properties the Gibbs posterior distribution enjoys. An advantage of our approach’s simplicity is the ease in which the convergence properties can be deduced, but some care is needed in formulating the asymptotic regime precisely. Indeed, since the two groups may have different sample sizes, it is clear that what we need is for the smaller of the two sample sizes to go to infinity. Therefore, the rate is determined by , and following theorem states that, under no conditions on the joint distribution of , the Gibbs posterior distribution concentrates asymptotically around the true AUC at the rate .
Theorem 1**.**
Let be the true AUC corresponding to the joint distribution , and assume, without loss of generality, that . If is the Gibbs posterior defined in (6) based on a fixed learning rate and a prior density that is positive and continuous in an interval containing , then for any sequence ,
[TABLE]
Proof.
See Appendix A. ∎
Several remarks on the concentration rate theorem, its consequences, and some related results are in order.
- •
The convergence in -probability conclusion in Theorem 1 can be strengthened to convergence with -probability by assuming that sample sizes for two groups increase at the same rate, i.e., . Under this condition, Korolyuk and Borovskich (2013, Chap. 3.2) show that with -probability 1 and, with this, the stronger Gibbs posterior concentration rate result can be proved along lines similar to those in Appendix A below.
- •
As shown in (4), the Gibbs posterior resembles a Bayesian posterior based on a suitably misspecified model, one whose “likelihood function” equals . Even in misspecified cases, Bernstein–von Mises-style distributional approximations are possible; see, e.g., Kleijn and van der Vaart (2012). In our case, we immediately see a truncated normal form of the Gibbs posterior, so as long as is in the interior of , the asymptotic normality of the Gibbs posterior is automatic.
- •
We note the loss scale controls the proportion of information in the Gibbs posterior which is learned from the data. Consequently, it is reasonable to adjust so that a set of observations with a larger size is given more trust. In fact, if we substitute the fixed in Theorem 1 with a sequence that vanishes slower than , then the Gibbs posterior concentration rate result still holds.
3.3 Tuning the learning rate
The good behavior of a Bayesian posterior is guaranteed only when the model is correctly specified. Under misspecification, even if the posterior concentrate around an efficient estimator, the asymptotic variance of the posterior could be drastically different from that of the efficient estimator; see Kleijn and van der Vaart (2012). Consequently, % credible regions from a misspecified Bayes model may not achieve the nominal % confidence, even asymptotically. Fortunately, the Gibbs posterior learning rate parameter, , which controls the spread, can be tuned in such a way that this undesirable discrepancy between credibility and confidence is avoided. Various tuning strategies are available in the literature (e.g., Bissiri et al. 2016; Fasiolo et al. 2017; Lyddon et al. 2019; Grünwald 2012), but only the approach presented in Syring and Martin (2019a) focuses directly on coverage probability, so that is the approach we will adopt here.
Algorithm 1 describes the calibrating procedure from Syring and Martin (2019a) in the context of inference on the AUC. The rationale behind this algorithm is as follows. Take a % credible interval based on the Gibbs posterior (6) with learning rate , in particular, the highest posterior density credible interval. Then the frequentist coverage probability of that credible interval, call it , depends on , , and other things. If we could evaluate , that is, if we knew and could directly simulate from , then we could just solve the equation . For future reference, in this ideal case, we call the solution to this equation the oracle learning rate. In real applications, however, is unknown, so we cannot evaluate exactly, but we can get an estimate using the bootstrap, and then solve that equation using stochastic approximation (Robbins and Monro 1951) with step size sequence that satisfies
[TABLE]
Details are discussed in Syring and Martin (2019a).
The method implemented in Algorithm 1 requires the repeated processing of bootstrap samples and, therefore, can be computationally expensive when the sample sizes are large. For such cases, however, there is an alternative strategy, based on ideas in Lyddon et al. (2019), that is both easier and faster, while still providing approximate calibration in the sense above. The idea is that we want the Gibbs posterior variance to be roughly equal to the variance of its center/mode, which is the Mann–Whitney estimator . Under the additional assumption that
[TABLE]
Hoeffding (1948, Theorem 7.3) showed that the asymptotic variance of is
[TABLE]
where
[TABLE]
with the covariance operator under joint distribution . If we take the flat prior in our Gibbs posterior construction, then choosing
[TABLE]
with the obvious estimates
[TABLE]
will make the Gibbs posterior variance approximately match the Mann–Whitney estimator variance, thus, approximate calibration. But note that our numerical results in Section 4 below are all based on the calibration strategy in Algorithm 1.
4 Numerical examples
4.1 Simulation studies
Since the AUC is invariant when random variables and undergo the same monotone increasing transformation, we fix the distribution of to be standard normal and consider four examples for the distribution of :
Example 1.
and ;
Example 2.
—skew normal—and ;
Example 3.
and ;
Example 4.
and .
Figure 2 provides a visualization of the two densities in each of the four examples. Note that these four examples capture binormality, a slight violation of binormality, a bimodal case, and one where and have different supports.
Here we compare the performance of the Gibbs posterior with the misspecified Bayesian model based on the rank-likelihood (BRL). For the Gibbs posterior, we use flat prior and follow Algorithm 1, where bootstrap samples are generated and , which satisfies (7). For the BRL, 50000 MCMC posterior samples are drawn, with burn-in of 10000. Tables 1–4 present (absolute) bias of the posterior estimator, average posterior standard deviation, average length of credible interval, and the corresponding coverage probability based on replications, with increasing observation sizes , for the four examples, respectively.
As can be seen from the bias and standard error columns, both the Gibbs and BRL posteriors concentrate around the true AUC, but the former—thanks to its built-in robustness—tends to have a smaller bias than the latter. The averaged credible interval length for BRL is slightly smaller than that for the Gibbs posterior, at least when the sample size is large, but at the cost of having unacceptably low coverage probability. Specifically, for large sample size, the credible intervals from the Gibbs posterior have coverage near the target level 0.95, while the corresponding BRL credible interval tend to under-cover, sometimes severely. Such a result is also demonstrated in Gu and Ghosal (2009). A possible explanation is that the posterior mean of BRL converge to but at a slower speed than the vanishing posterior spread.
Finally, we investigate the learning rate estimates under the Gibbs setting. Figure 3 shows, for each of the four simulation examples, the oracle learning rate (red) compared to those obtained from Algorithm 1. Recall, from Section 3.3, the oracle learning rate corresponds to exact credibility–coverage matching, so the fact that the estimates based on Algorithm 1 closely follow the oracle is further indication that our Gibbs posterior is properly calibrated to achieve the desired coverage probability. Note, also, that the slope of the red line is roughly which, on the log scale, agrees with the tolerable decay rate, , suggested by the general theory in Section 3.2.
4.2 Real data analysis
Data consisting of serum measurements of two biomarkers for pancreatic cancer was published by Wieand et al. (1989); see, also, the R package logcondens. This was a case-control study including subjects from the diseased group and subjects from the non-diseased group. Specifically, we consider one biomarker, a cancer antigen (CA-125), and evaluate its performance as a classifier to distinguish the case group from the control group. Table 5 presents results from two Gibbs posteriors and two BRLs. Gibbs1 and Gibbs2 employ Algorithm 1 with flat prior and truncated normal prior (), respectively. The two BRLs start the MCMC sampling with different initial values for , namely, for BRL1 and for BRL2, respectively, and use posterior samples with burn-in. The two Gibbs posteriors have estimates slightly larger than that from two BRLs, with comparable standard errors. The two BRL credible intervals are slightly shorter than the Gibbs intervals but, in light of the simulation results presented above, especially in the case of relatively large samples like considered here, it is likely that the BRL intervals are “too short,” while the Gibbs intervals are not.
5 Conclusion
In certain applications, the parameters of interest can be defined as minimizers of an appropriate risk function, separate from any statistical model. In such cases, one can avoid potential model misspecification biases by working some kind of “model-free” approach. The present paper considered one such example, namely, inference on the AUC, where the state-of-the-art statistical model is one that depends on an infinite-dimensional nuisance parameter. As an alternative to switching to rank-based methods that ignore relevant features of the observed data, we propose to construct a Gibbs posterior distribution for direct inference on the AUC, without specifying a model or introducing any nuisance parameters. This simplifies our computations and prior specifications, while allowing us to avoid potential model misspecification biases without sacrificing on the desirable asymptotic convergence properties. Moreover, a strategy for tuning the Gibbs posterior’s learning rate is recommended, that leads to credible intervals having the nominal frequentist coverage probability.
A direct extension of our work here is the inference on the analog of AUC in settings that involve three-group classifiers, namely, the volume under the ROC surface, or VUS (e.g., Mossman 1999). Similar to the set up here for the AUC, the VUS is defined as , where is the score for the third group. Then much of the work presented here can be immediately generalized to the VUS case.
It would also be worthwhile to explore applications of the Gibbs posterior in other multivariate settings. One example is inference on multivariate quantiles, which are typically defined as minimizers of some expected loss (e.g., Chaudhuri 1996), so the construction of a Gibbs posterior is both appealing and relatively simple.
Acknowledgments
This work is partially supported by the U.S. National Science Foundation, DMS–1811802.
Appendix A Proof of Theorem 1
First, recall that, without loss of generality, we assume and , which implies that too. Next, when (and, hence, ) is large, will blow up around and, since the prior is fixed—and positive in an interval containing and, hence, —the Gibbs posterior will be dominated by the empirical risk term. Therefore, the prior does not affect the asymptotics so, for simplicity, we present the proof only for the case of a flat prior, .
By Chebyshev’s inequality and the bias–variance decomposition of mean square error,
[TABLE]
where and are the mean and variance of the Gibbs posterior distribution, respectively, and are given by
[TABLE]
with and the density and distribution functions, respectively, and
[TABLE]
Since is a consistent estimator of (see below), we clearly have that and , so those ratios involving and above are all . Then we can immediately conclude that which takes care of the variance term. For the bias term, we first have that , the Mann–Whitney statistic, is an unbiased estimator of and its variance is upper-bounded by
[TABLE]
Therefore, for any , there exists a number such that
[TABLE]
To see this, use Chebyshev’s inequality and the bound on the variance of to get that the left-hand side above is upper-bounded by
[TABLE]
Since and , we can take sufficiently large that the previous display is less than . This implies that and, hence, is . Putting everything together, we have that the right-hand side of (10) is
[TABLE]
But since , we have that the upper-bound in (10) converges to 0 in -probability as , proving the claim.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alquier et al. (2016) Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approximations of gibbs posteriors. The Journal of Machine Learning Research , 17(1):8374–8414.
- 2Bamber (1975) Bamber, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology , 12(4):387–415.
- 3Bissiri et al. (2016) Bissiri, P. G., Holmes, C. C., and Walker, S. G. (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(5):1103–1130.
- 4Bradley (1997) Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition , 30(7):1145–1159.
- 5Brodersen et al. (2010) Brodersen, K. H., Ong, C. S., Stephan, K. E., and Buhmann, J. M. (2010). The binormal assumption on precision-recall curves. In 2010 20th International Conference on Pattern Recognition , pages 4263–4266. IEEE.
- 6Cai and Moskowitz (2004) Cai, T. and Moskowitz, C. S. (2004). Semi-parametric estimation of the binormal ROC curve for a continuous diagnostic test. Biostatistics , 5(4):573–586.
- 7Calì and Longobardi (2015) Calì, C. and Longobardi, M. (2015). Some mathematical properties of the ROC curve and their applications. Ricerche di Matematica , 64(2):391–402.
- 8Chaudhuri (1996) Chaudhuri, P. (1996). On a geometric notion of quantiles for multivariate data. Journal of the American Statistical Association , 91(434):862–872.
