Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies
Zhenke Wu, Irena Chen

TL;DR
This paper extends a statistical model to incorporate covariate effects in disease cause estimation from case-control studies, improving accuracy and inference of disease etiology.
Contribution
It introduces a regression framework for nested partially-latent class models that accounts for covariates and control data, enhancing disease etiology estimation.
Findings
Less biased estimation of population etiologic fractions (PEFs).
More valid inference compared to models ignoring covariates.
Application reveals disease etiology dependence on season, age, severity, and HIV status.
Abstract
In large-scale disease etiology studies, epidemiologists often need to use multiple binary measures of unobserved causes of disease that are not perfectly sensitive or specific to estimate cause-specific case fractions, referred to as "population etiologic fractions" (PEFs). Despite recent methodological advances, the scientific need of incorporating control data to estimate the effect of explanatory variables upon the PEFs, however, remains unmet. In this paper, we build on and extend nested partially-latent class model (npLCMs, Wu et al., 2017) to a general framework for etiology regression analysis in case-control studies. Data from controls provide requisite information about measurement specificities and covariations, which is used to correctly assign cause-specific probabilities for each case given her measurements. We estimate the distribution of the controls' diagnostic measures…
| age | very severe (VS) | HIV positive | cases () | controls () |
| (case-only) | total: 524 (100) | total: 964 (100) | ||
| \rowcolor[HTML]C0C0C0 0 | 0 | 0 | 208 (39.7) | 545 (56.5) |
| \rowcolor[HTML]C0C0C0 1 | 0 | 0 | 72 (13.7) | 278 (28.8) |
| 0 | 1 | 0 | 116 (22.1) | 0 |
| 1 | 1 | 0 | 33 (6.3) | 0 |
| 0 | 0 | 1 | 37 (7.1) | 85 (8.8) |
| 1 | 0 | 1 | 24 (4.5) | 51 (5.3) |
| 0 | 1 | 1 | 25 (4.8) | 0 |
| 1 | 1 | 1 | 3 (0.6) | 0 |
| case: | ||||
| control: | - |
| site\cause | A | B | C | D | E | F |
| 1 | 0.5 | 0.2 | 0.15 | 0.05 | 0.05 | 0.05 |
| 2 | 0.2 | 0.5 | 0.15 | 0.05 | 0.05 | 0.05 |
| 3 | 0.2 | 0.15 | 0.5 | 0.05 | 0.05 | 0.05 |
| 4 | 0.2 | 0.15 | 0.05 | 0.5 | 0.05 | 0.05 |
| 5 | 0.2 | 0.15 | 0.05 | 0.05 | 0.5 | 0.05 |
| 6 | 0.2 | 0.15 | 0.05 | 0.05 | 0.05 | 0.5 |
| 7 | 0.05 | 0.2 | 0.15 | 0.5 | 0.05 | 0.05 |
| site\cause | A | B | C | D | E | F |
| 1 | 73 | 100 | 100 | 99 | 100 | 100 |
| 2 | 100 | 79 | 100 | 100 | 100 | 99 |
| 3 | 100 | 100 | 83 | 98 | 100 | 100 |
| 4 | 100 | 100 | 100 | 73 | 100 | 99 |
| 5 | 99 | 100 | 100 | 100 | 85 | 100 |
| 6 | 100 | 100 | 100 | 99 | 100 | 88 |
| 7 | 100 | 100 | 100 | 81 | 100 | 99 |
| site \cause | A | B | C | D | E | F | ||
| I | coverage | 1 | 99 | 93 | 97 | 94 | 96 | 90 |
| 2 | 97 | 90 | 96 | 97 | 95 | 94 | ||
| 3 | 100 | 95 | 98 | 98 | 95 | 96 | ||
| 4 | 93 | 94 | 96 | 95 | 92 | 99 | ||
| 5 | 96 | 94 | 96 | 97 | 95 | 98 | ||
| 6 | 96 | 97 | 98 | 99 | 95 | 96 | ||
| 7 | 96 | 97 | 91 | 100 | 95 | 96 | ||
| posterior summary | truth (Site 1) | 0.5 | 0.2 | 0.15 | 0.05 | 0.05 | 0.05 | |
| average of post. mean | 0.495 | 0.197 | 0.152 | 0.053 | 0.053 | 0.051 | ||
| average of post. s.d. | 0.023 | 0.018 | 0.016 | 0.01 | 0.01 | 0.01 | ||
| average PMSE | 0.0010 | 0.0007 | 0.0005 | 0.0002 | 0.0002 | 0.0002 | ||
| II∗ | coverage | 1 | 98 | 89 | 98 | 99 | 100 | 100 |
| 2 | 97 | 95 | 96 | 100 | 100 | 99 | ||
| 3 | 93 | 98 | 91 | 99 | 99 | 100 | ||
| 4 | 95 | 98 | 100 | 95 | 99 | 100 | ||
| 5 | 94 | 94 | 99 | 99 | 91 | 100 | ||
| 6 | 95 | 97 | 100 | 99 | 99 | 90 | ||
| 7 | 100 | 95 | 94 | 96 | 100 | 99 | ||
| posterior summary | truth (Site 1) | 0.5 | 0.2 | 0.15 | 0.05 | 0.05 | 0.05 | |
| average post. mean | 0.417 | 0.163 | 0.138 | 0.091 | 0.086 | 0.106 | ||
| average post. s.d. | 0.27 | 0.174 | 0.162 | 0.135 | 0.13 | 0.141 | ||
| average PMSE | 0.131 | 0.067 | 0.056 | 0.034 | 0.031 | 0.042 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Bayesian Inference · Pneumonia and Respiratory Infections · Data-Driven Disease Surveillance
Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies
Zhenke Wu
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; E-mail: [email protected].
Michigan Institute for Data Science, University of Michigan, Ann Arbor, MI 48109, USA.
Irena Chen
Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; E-mail: [email protected].
Abstract
In large-scale disease etiology studies, epidemiologists often need to use multiple binary measures of unobserved causes of disease that are not perfectly sensitive or specific to estimate cause-specific case fractions, referred to as “population etiologic fractions” (PEFs). Despite recent methodological advances, the scientific need of incorporating control data to estimate the effect of explanatory variables upon the PEFs, however, remains unmet. In this paper, we build on and extend nested partially-latent class model (npLCMs, Wu et al.,, 2017) to a general framework for etiology regression analysis in case-control studies. Data from controls provide requisite information about measurement specificities and covariations, which is used to correctly assign cause-specific probabilities for each case given her measurements. We estimate the distribution of the controls’ diagnostic measures given the covariates via a separate regression model and a priori encourage simpler conditional dependence structures. We use Markov chain Monte Carlo for posterior inference of the PEF functions, cases’ latent classes and the overall PEFs of policy interest. We illustrate the regression analysis with simulations and show less biased estimation and more valid inference of the overall PEFs than an npLCM analysis omitting covariates. An regression analysis of data from a childhood pneumonia study site reveals the dependence of pneumonia etiology upon season, age, disease severity and HIV status.
Keywords: Bayesian methods; Case-control studies; Disease etiology; Latent class regression analysis; Measurement errors; Pneumonia; Semi-supervised learning.
1 Introduction
In epidemiologic studies of disease etiology, one important scientific goal is to assess the effect of explanatory variables upon disease etiology. Based on multiple binary non-gold-standard diagnostic measurements made on a list of putative causes with different error rates, this paper develops and demonstrates a regression analytic approach for drawing inference about the cause-specific fractions among the case population that depend on covariates. We illustrate the analytic needs raised by a study of pediatric pneumonia etiology.
Pneumonia is a clinical condition associated with infection of the lung tissue, which can be caused by more than 30 different species of microorganisms, including bacteria, viruses, mycobacteria and fungi (Scott et al.,, 2008). The Pneumonia Etiology Research for Child Health (PERCH) study is a seven-country case-control study of the etiology of severe and very severe pneumonia and has enrolled more than hospitalized children under five years of age and more than healthy controls (PERCH Study Group,, 2019). The goal of the PERCH study is to estimate the population fractions of cases due to the pathogen causes, referred to as “population etiologic fractions” (PEFs) and to assign cause-specific probabilities for each pneumonia child given her measurements, termed as “individual etiologic fractions” (IEFs). The PERCH study also aims to understand the variation of the PEFs as a function of factors such as region, season, a child’s age, disease severity, nutrition status and human immunodeficiency virus (HIV) status.
The cause of lung infection cannot, except in rare cases, be directly observed (Hammitt et al.,, 2017). The PERCH study tests the presence or absence of a list of pathogens using specimens in peripheral compartments including the blood, sputum, pleural fluid and nasopharyngeal (NP) cavity (Crawley et al.,, 2017). In this paper, we focus on two sources of imperfect measurements: (a) NP Polymerase Chain Reaction (NPPCR) results from cases and controls that are not perfectly sensitive or specific, referred to as “bronze-standard” (BrS) data; and (b) blood culture (BCX) results from cases only that are perfectly specific but lack sensitivity, referred to as “silver-standard” (SS) data.
Valid inference about the population and individual etiologic fractions must address three salient characteristics of the measurements. First, tests lacking sensitivity such as NPPCR and BCX may miss true causative agent(s) which if unadjusted may underestimate the PEFs. Second, imperfect diagnostic specificities may result in the detection of multiple pathogens in NPPCR that may indicate asymptomatic carriage but not causes of pneumonia. Determining the primary causative agent(s) must use statistical controls. Third, multiple specimens are tested among the cases with only a subset available from the controls. Other large-scale disease etiology studies have raised similar analytic needs and challenges of integrating multiple sources of imperfect measurements of multiple pathogens to produce an accurate understanding of etiology (e.g., Saha et al.,, 2018; Kotloff et al.,, 2013).
To address the analytic needs, Wu et al., (2016) introduced a partially-latent class model (pLCM) as an extension to classical latent class models (LCMs Lazarsfeld,, 1950; Goodman,, 1974) that uses case-control data to estimate the PEFs. This prior work shows the capacity of the multivariate specimen measurements to inform the distribution of unobserved, or “latent” health status for an individual and the population. PLCM is a finite mixture model with components for multivariate binary data where a case observation is drawn from a mixture of components each representing a cause of disease, or “disease class”; Controls have no infection in the lung hence are assumed drawn from an observed class. The pLCM is a semi-supervised method for learning the unobserved classes, where the “label” (cause of disease) is observed for only a subset of subjects. Let represent case ’s disease class which is categorically distributed with probabilities equal to the PEFs in the -dimensional simplex . A case class can represent a single- or multiple-pathogen cause of pneumonia, or pathogen causes not targeted by the assays which we refer to as “Not Specified (NoS)”. PLCM uses a vector of response probabilities to specify the conditional distribution of the measurements in each class. PLCM is an example of restricted LCMs (RLCMs, Wu et al.,, 2019) which restrict how the response probabilities differ by class to reflect the scientific knowledge that causative pathogens are more likely to appear in the upper respiratory tract in a pneumonia child than a healthy control. In particular, each causative pathogen is assumed to be observed with a higher probability in case class (sensitivity or true positive rate, TPR) than among the controls; A non-causative pathogen is observed with the same probability as in the controls (1 - specificity or false positive rate). Under the pLCM, a higher observed marginal positive rate of pathogen among cases than controls indicates etiologic significance.
In a Bayesian framework, measurements of differing precisions can be optimally combined under a pLCM to generate stronger evidence about . The pLCM is partially-identified (Jones et al.,, 2010; Gu and Xu, 2019b, ). There exist two sets of values of a subset of model parameters (here the TPRs) that the likelihood function alone cannot distinguish even with infinite samples; Bounds on the parameters however are available (e.g., Wu et al.,, 2016, Equation 6). Informative prior distributions for the TPRs elicited from laboratory experts or estimated from vaccine probe studies for a subset of pathogens (Feikin et al.,, 2014) can be readily incorporated to improve inference (Gustafson,, 2015).
The pLCM assumes “local independence” (LI) which means the BrS measurements are mutually independent given the class membership. This classical assumption is central to mixture models for multivariate data, because the estimation procedures essentially find the optimal partition of observations into subgroups so that the LI approximately holds in each subgroup. Deviations from LI, or “local dependence” (LD) are testable using the control BrS data, which can be accounted for by an extension of pLCM, called nested partially-latent class model (npLCM, Wu et al.,, 2017). In each class, the npLCM uses the classical LCM formulation that has the capacity to describe complex multivariate dependence among discrete data (Dunson and Xing,, 2009). For example, it assumes the within-class correlations among NPPCR tests are induced by unobserved heterogeneity in subjects’ propensities for pathogens colonizing the nasal cavities. In particular, LD is induced in an npLCM by nesting latent subclasses within each class , where subclasses respond with distinct vectors of probabilities. In a Bayesian framework with a prior that encourages few important subclasses, the npLCM reduces the bias in estimating , retains estimation efficiency and offers more valid inference under substantial deviation from LI.
Extensions to incorporate covariates in an npLCM are critical for two reasons. Firstly, covariates such as season, age, disease severity and HIV status may directly influence . Secondly, in an npLCM without covariates, the relative probability of assigning a case subject to class versus class depends on the FPRs (Wu et al.,, 2016) which are estimable using the control data. However, the FPRs may vary by covariates which if not modeled will bias the assignment of cause-specific probabilities for each case subject. For example, pathogen A found in a case’s nasal cavity less likely indicates etiologic significance than a colonization during seasons with high asymptomatic carriage rates, and much more so when the same pathogen rarely appears in healthy subjects.
Adapting existing no-covariate methods to account for discrete covariates, one may perform a fully-stratified analysis by fitting an npLCM to the case-control data in each covariate stratum. Like pLCM, the npLCM is partially-identified in each stratum (Wu et al.,, 2017), necessitating multiple sets of independent informative priors across multiple strata. There are two primary issues with this approach. First, sparsely-populated strata defined by many discrete covariates may lead to unstable PEF estimates. Second, it is often of policy interest to quantify the overall cause-specific disease burdens in a population. Let the overall PEFs be the empirical average of the stratum-specific PEFs. Since the informative TPR priors are often elicited for a case population and rarely for each stratum, reusing independent prior distributions of the TPRs across all the strata will lead to overly-optimistic posterior uncertainty in , hampering policy decisions.
Estimating disease etiology across discrete and continuous epidemiologic factors needs new methods in a general modeling framework. In this paper, we extend the npLCM to perform regression analysis in case-control disease etiology studies that (a) incorporates controls to estimate the PEFs, (b) specifies parsimonious functional dependence of upon covariates such as additivity, and (c) correctly assesses the posterior uncertainty of the PEF functions and the overall PEFs by applying the TPR priors just once.
The rest of the paper is organized as follows. Section 2 overviews the npLCM without covariates. Section 3 builds on the npLCM and makes the regression extension. We demonstrate the estimation of disease etiology regression functions through simulations in Section 4; We also show superior inferential performance of the regression model in estimating the overall PEFs relative to an analysis omitting the covariates. In Section 5, we characterize the effect of seasonality, age, HIV status upon the PEFs by applying the proposed npLCM regression model to the PERCH data. The paper concludes with a discussion.
2 Overview of npLCMs without Covariates
Let binary BrS measurements indicate the presence or absence of pathogens for subject . Let indicate a case () or a control ([math]) subject. If , let represent case ’s unobserved disease class; Otherwise, let because a control subject’s class is known (in PERCH, no lung infection). In this paper, we simplify the presentation of models by focusing on single-pathogen causes (hence ). The npLCM readily extends to for including additional pre-specified multi-pathogen and/or “Not Specified” (NoS) causes (Wu et al.,, 2017).
The likelihood function for an npLCM has three components: (a) PEFs or cause-specific case fractions: ; (b) : a table of probabilities of making binary observations in a case class ; (c) : the same probability table but for controls. Since cases’ disease classes are unobserved, the distribution of cases’ measurements is a finite-mixture model with weights for the disease classes: .
Models in this section differ by how and {} are specified; Regression models in Section 3 further incorporate covariate into the specifications ( as well). More specifically, the likelihood of an npLCM (Wu et al.,, 2017) is a product of case () and control () likelihood functions
[TABLE]
where and are sensitivity and specificity parameters necessary for modeling the imperfect measurements; The rest of parameters , . Existing methods for estimating in the framework of npLCM can be classified by whether or not and assumes local independence (LI) which means measurements are independent of one another given the class (). In Equation (1), LI results if and only if ; Otherwise, and account for deviations from LI given a control or disease class.
PLCM. under the original pLCM (Wu et al.,, 2016) satisfies LI and equals a product of probabilities: , where is the probability mass function for a product Bernoulli distribution given the success probabilities , and the parameters represent the positive rates absent disease, referred to as “false positive rates” (FPRs). For example, in the PERCH data, Respiratory Syncytial Virus (RSV) has a low observed FPR because of its rare appearance in controls’ NPs; Other pathogens such as Rhinovirus (RHINO) have higher observed FPRs.
For , the pLCM makes a key “non-interference” assumption that disease-causing pathogen(s) are more frequently detected among cases than controls and the non-causative pathogens are observed with the same rates among cases as in controls (Wu et al.,, 2017). The “non-interference” assumption says that in a case class is a product of the probabilities of measurements made (a) on the causative pathogen , , where and (b) on the non-causative pathogens , where represents all but the -th element in a vector . The parameter is termed “true positive rate” (TPR) and may be larger than the FPR ; Under the single-pathogen-cause assumption, pLCM uses TPRs for causes and FPRs .
NPLCM. To reduce estimation bias in under deviations from LI, the “nested pLCM” or npLCM extends the original pLCM to describe residual correlations among binary pathogen measurements in the controls () and in each case class (, ) (Wu et al.,, 2017). The extension is motivated by the ability of the classical LCM formulation (Lazarsfeld,, 1950) to approximate any joint multivariate discrete distribution (Dunson and Xing,, 2009).
For in the controls, the npLCM introduces subclasses; The original pLCM results if . Given a subclass , the probability of observing binary measurements among controls is , where is the -th column of a by FPR matrix . Since we do not observe controls’ subclasses, is a weighted average of according to the subclass probabilities : .
For in case class , the npLCM again introduces unobserved subclasses and assumes is a weighted average of according to the case subclass weights : . In particular, the npLCM assumes the probability of observing in subclass in disease class , , is a product of the probabilities of making an observation (a) on the causative pathogen : and (b) on non-causative pathogens , where is the -th column of excluding the -th row. We collect the TPRs in a by TPR matrix . We summarize the preceding specification by , where the vector represents the positive rates for measurements in subclass of disease class : which equals the TPR for a causative pathogen and the FPR otherwise; Here is an indicator function that equals if the statement A is true and [math] otherwise.
The likelihood for npLCM results upon substituting and above into Equation (1): . Setting and , the special case of pLCM results.
Similar to the pLCM, the FPRs in the npLCM are shared among controls and case classes over non-causative pathogens (via ). Different from the pLCM, the subclass mixing weights may differ between cases () and controls (). The special case of , means the covariation patterns among the non-causative pathogens in a disease class is no different from the controls. However, relative to controls, diseased individuals may have different strength and direction of measurement dependence in each disease class. By allowing the subclass weights to differ between the cases and the controls, npLCM is more flexible than pLCM in referencing cases’ measurements against controls.
3 Regression Analysis via npLCM
We extend npLCM to perform regression analysis of data , where are covariates that may influence case ’s etiologic fractions and is a possibly different vector of covariates that may influence the subclass weights among the controls and the cases; Let the continuous covariates comprise the first and elements of and , respectively. A subset of may be available from the cases only. We let if so that all the covariates for a control subject are included in ; Let for a case subject. For example, healthy controls have no disease severity information. We let three sets of parameters in an npLCM (1) depend on the observed covariates: (a) the etiology regression function among cases, , which is of primary scientific interest, (b) the conditional probability of measurements given covariates in case classes: , , (c) and in the controls ; We keep the specifications for the TPRs and FPRs (, ) as in the original npLCM.
3.1 Disease Etiology Regression
is the primary target of inference. Recall that represents case ’s disease being caused by pathogen . We assume this event occurs with probability that depends upon covariates. In our model, we use a multinomial logistic regression model , , where is the log odds of case in disease class relative to : . Without specifying a baseline category, we treat all the disease classes symmetrically which simplifies prior specification. We further assume additive models for , where is the subvector of the predictors that enters the model for all disease classes as linear predictors and collects all the parameters. For covariates such as enrollment date that serves as a proxy for factors driven by seasonality, nonlinear functional dependence is expected. We use B-spline basis expansion to approximate and use P-spline for estimating smooth functions (Lang and Brezger,, 2004). Finally, we specify the distribution of case measurements given disease class , covariates and . We extend the case likelihood in an npLCM (1) to let the subclass weights depend on covariates : . Integrating over unobserved disease classes, we obtain the likelihood function for the cases that incorporates covariates :
[TABLE]
where and are the regression parameters; The form of is introduced in the model for controls.
3.2 Covariate-dependent reference distribution
Data from controls provide requisite information about the specificities and covariations at distinct covariate values, necessitating adjustment in an npLCM analysis. For example, factors such as enrollment date is a proxy for season and may influence the background colonization rates and interactions of some pathogens that circulate more during winter (Obando-Pacheco et al.,, 2018; Nair et al.,, 2011). We propose a novel approach to estimating the reference distribution of measurements that may depend on covariates using control data.
The regression model for a control subject is a mixture model with covariate-dependent mixing weights : where FPRs do not depend on covariates and the vector lies in a -simplex . We discuss the FPRs and the subclass weight functions in order.
Firstly, constant FPR profiles enable coherent interpretation across individuals with different covariate values (Erosheva et al.,, 2007). FPR profile receives a weight of for a control subject with covariates . The marginal FPRs in the controls , , also depend on . Consequently, observed marginal control positive curve for a pathogen informs how different the FPRs are across the subclasses. For example, if the NPPCR measure of pathogen A shows strong seasonal trends among the controls, the estimated FPRs will be more variable across the subclasses. And the subclass with a high FPR will receive a larger weight during seasons with higher carriage rates in controls. The control model reduces to special cases, with covariate-independent , , resulting in the in a -subclass npLCM without covariates; A further single-subclass constraint () gives the in the original pLCM.
Secondly, we parameterize the case and control subclass weight regressions and using the same regression form but different parameters.
Control subclass weight regression. We rewrite the subclass weights , using a stick-breaking parameterization. Let be a link function. Let be subject ’s linear predictor at stick-breaking step . Using the stick-breaking analogy, we begin with a unit-length stick, break a segment of length and continue breaking a fraction from the remaining and so on; At step , we break a fraction from what is left in the preceding steps resulting in the -th stick segment of length ; We stop until sticks of variable lengths result. In this paper, we use the logistic function which is consistent with the multinomial logit regression for so that the priors of the coefficients and can be similar (Supplementary Materials A.2). Generalization to other link functions such as the probit function is straightforward (e.g., Rodriguez and Dunson,, 2011). We use this parameterization to introduce a novel shrinkage prior on a simplex for the subclass weights (see Supplementary Materials A.1) which encourages fewer than effective subclasses, or “-sparse” shrinkage prior on the simplex. This provides parsimonious approximation to the conditional distribution of control measurements using a few subclasses.
In our analysis, we use generalized additive models (Hastie and Tibshirani,, 1986) for the -th linear predictor for . We have parameterized the possibly nonlinear using B-spline basis expansions with coefficients ; are the linear effects of a subset of predictors which can include an intercept and is a subvector of predictors ; Let collect all the regression parameters. Following Lang and Brezger, (2004), we constrain to have zero means for statistical identifiability. Supplementary Materials A.2 provides the technical details about the parameterization of .
The subclass-specific intercepts globally control the magnitudes of the linear predictors. We hence propose priors on to a priori encourage few subclasses (see Supplementary Materials A.1). In particular, a large positive intercept makes and hence breaks nearly the entire remaining stick after the -th stick-breaking. Since the stick-breaking parameterization one-to-one maps to a classical latent class regression model formulation for the control data, the linear predictor and the sum are identifiable except in a Lebesgue zero set of parameter values, or “generic identifiability” (Huang and Bandeen-Roche,, 2004). Consequently, even if the intercept is not statistically identified if includes an intercept , the MCMC samples of the statistically identifiable functions can provide valid posterior inferences (Carlin and Louis,, 2009). We write the control likelihood with covariates as . Supplementary Materials B provides further remarks on the assumption for introducing covariates into the control model.
Case subclass weight regression. The subclass weight regression functions for cases are also specified via a logistic stick-breaking regression as in the controls but with different parameters: , . Since given the TPRs and the FPRs, the subclass weights fully determine the joint distribution hence the measurement dependence in each class, we let and be different between cases and controls for any .
Let the -th linear predictor , where are the regression parameters that differ from the control counterpart (). In particular, we approximate , here using the same set of B-spline basis functions as in the controls but estimate a different set of basis coefficients . In addition, we have directly used the intercepts from the control model to ensure only important subclasses in the controls are used in the cases. For example, absent covariates , a large and positive effectively halts the stick breaking procedure at step for the controls (); Applying the same intercept to the cases makes .
Combining the case () and control likelihood () with covariates, we obtain the joint likelihood for the regression model .
Remark 1**.**
Under an assumption (A1): the case subclass weights are constant over covariates: , , the regression model reduces to an npLCM model without covariates upon integration over a distribution of covariates . To see this, the case and control likelihood functions and integrate to and , respectively; Here and where and are probability or empirical distributions of and , respectively. The mathematical equivalence enables valid inference about the overall PEFs omitting and (see Supplementary Materials E.2 for an example). The no-covariate analysis becomes deficient under deviations from (A1); Section 4 provides examples.
3.2.1 Priors and Posterior Inference
The unknown parameters include the coefficients in the etiology regression (, the subclass mixing weight regression for the cases () and the controls (), the true and false positive rates . With typical samples sizes about controls and cases in each study site, the number of parameters in controls likelihood () easily exceeds the number of distinct binary measurement patterns observed. To overcome potential overfitting and increase model interpretability, we a priori place substantial probabilities on models with the following two features: (a) Few non-trivial subclasses via a novel additive half-Cauchy prior for the intercepts , and (b) for a continuous variable, smooth regression curves , and by Bayesian Penalized-splines (P-splines) (Lang and Brezger,, 2004) combined with shrinkage priors on the spline coefficients (Ni et al.,, 2015) to encourage towards constant values, , which reduces to the original npLCM. Supplementary Materials A details the prior specifications.
We use the Markov chain Monte Carlo (MCMC) algorithm to draw samples of the unknowns to approximate their joint posterior distribution (Gelfand and Smith,, 1990). Flexible posterior inferences about any functions of the model parameters and individual latent variables are available by plugging in the posterior samples of the unknowns. For example, the posterior samples of the case positive rate curve for pathogen help evaluate model fit. The red bands in Row 1 of Figure 1 are posterior credible bands obtained by substituting relevant parameters with their sampled values across MCMC iterations in . The npLCMs with or without covariates are fitted using a free and publicly available R package baker (https://github.com/zhenkewu/baker). Baker calls an external automatic Bayesian model fitting software JAGS 4.2.0 (Plummer et al.,, 2003) from within R and provides functions to visualize the posterior distributions of the unknowns (e.g., the PEFs and cases’ latent disease class indicators) and perform posterior predictive model checking (Gelman et al.,, 1996). Supplementary Materials C details the convergence diagnostics.
4 Simulations
We simulate case-control bronze-standard (BrS) measurements along with observed continuous and/or discrete covariates under multiple combinations of true model parameter values and sample sizes that mimic the motivating PERCH study. In Simulation I, we illustrate flexible statistical inferences about the PEF functions . In Simulation II, we focus on the overall PEFs that quantify the overall cause-specific disease burdens in a population which are of policy interest. Let be an empirical average of , . We compare the frequentist properties of the posterior mean obtained from analyses with or without covariate (Little et al.,, 2011). Regression analyses reduce estimation bias, retain efficiency and provide more valid frequentist coverage of the CrIs. The relative advantage varies by the true data generating mechanism and sample sizes.
In all analyses here, we use a working number of subclasses, with independent Beta(7.13,1.32) TPR prior distributions that match 0.55 and 0.99 with the lower and upper quantiles, respectively; We specify Beta(1,1) for the identifiable FPRs. The priors for the regression coefficients follow the specifications in Supplementary Materials A.
Simulation I. We demonstrate that the inferential algorithm recovers the true PEF functions . We simulate cases and controls for each of two levels of (a discrete covariate) and uniformly sample the subjects’ enrollment dates over a period of days. Supplementary Materials D specifies the true data generating mechanism and the regression specifications. Based on the simulated data, pathogen A has a bimodal positive rate curve mimicking the trends observed of RSV in one PERCH site; other pathogens have overall increasing positive rate curves over enrollment dates. We set the simulation parameters in a way that the marginal control rate may be higher than cases for small ’s (impossible under the more restrictive pLCM). Row 2 of Figure 1 visualizes for the causes (by column), the posterior means (thin black line) and CrIs (gray bands) for the etiology regression curves are close to the simulation truths . Supplementary Materials E provides additional simulation results to assess the recovery of the true for a discrete covariate .
Simulation II. We show the regression model accounts for population stratification by covariates hence reduces the bias of the posterior mean in estimating the overall PEFs () and produces more valid CrIs. We illustrate the advantage of the regression approach under simple scenarios with a single two-level covariate ; We let . We perform npLCM regression analysis with for each of replication data sets simulated under each of scenarios detailed in Supplementary Materials D that correspond to distinct numbers of causes, sample sizes, relative sizes of PEF functions (rare versus popular etiologies), signal strengths (more discrepant TPRs and FPRs indicate stronger signals, Wu et al., (2016)), and effects of on and .
In estimating , we evaluate the bias , where is the true overall PEF, and is an empirical average of the posterior mean PEFs at . We also evaluate the empirical coverage rates of the CrIs.
The regression model incorporates covariates and performs better in estimating than a model omitting covariates. For example, Figure 2(a) shows for that, relative to no-covariate npLCM analyses, regression analyses produce posterior means that on average have negligible relative biases (percent difference between the posterior mean and the truth relative to the truth) for each pathogen across simulation scenarios. As expected, we observe slight relative biases from the regression model in the bottom two rows of Figure 2(a), because the informative TPR prior Beta(7.13,1.32) has a mean value lower than the true TPR ; A more informative prior further reduces the relative bias; See additional simulations in Supplementary Materials E on the role of informative TPR priors. Figure 2(b) regression analyses also produce CrIs for that have more valid empirical coverage rates in all scenarios. Misspecified models without covariates concentrate the posterior distribution away from the true overall PEFs, resulting in large biases that dominate the posterior uncertainty of which is evident from the more severe undercoverages with higher TPRs and lower FPRs (row 3 and 4 versus row 1 and 2, Figure 2).
5 Regression Analysis of PERCH Data
We restrict attention in this regression analysis to cases and controls from one of the PERCH study sites in the Southern Hemisphere that collected information on enrollment date (, August 2011 to September 2013; standardized), age (dichotomized to younger or older than one year), disease severity for cases (severe or very severe), HIV status (positive or negative) and presence or absence of seven species of pathogens (five viruses and two bacteria, representing a subset of pathogens evaluated) in nasopharyngeal (NP) specimens tested with polymerase chain reaction (PCR), or NPPCR (bronze-standard, BrS); We also include in the analysis the blood culture (BCX, silver-standard, SS) results for two bacteria from cases only. Detailed analyses of the entire data are reported in PERCH Study Group, (2019).
Table 1 shows the observed case and control frequencies by age, disease severity and HIV status. The two strata with the most subjects are severe pneumonia children who were HIV negative and under or above one year of age. Some low or zero cell counts preclude fitting npLCMs by stratum. Regression models with additive assumptions among the covariates can borrow information across strata and stabilize the PEF estimates. Supplemental Figure S5 shows summary statistics for the NPPCR (BrS) and BCX (SS) data including the positive rates in the cases and the controls and the conditional odds ratio (COR) contrasting the case and control rates adjusting for the presence or absence of other pathogens (NPPCR only).
For NPPCR, pathogens RSV and Haemophilus influenzae (HINF) are detected with the highest positive rates among cases: and , respectively, which are higher than the corresponding control rates ( and ). The CORs are large, for RSV and for HINF, indicating etiologic importance. Adenovirus (ADENO) also has a statistically significant COR of . Human metapneumovirus type A or B (HMPV_A_B) and Parainfluenza type 1 virus (PARA_1) have larger positive and statistically significant CORs of and . However, the two pathogens rarely appear in cases’ nasal cavities (HMPV_A_B: , PARA_1: ), which in light of high sensitivities means non-primary etiologic roles. For the rest of pathogens, we observed similar case and control positive rates as shown by the statistically non-significant CORs (RHINO (case: ; control: ) and Streptococcus pneumoniae (PNEU) (case: ; control: ). Similar to Wu et al., (2017), we integrate case-only SS measurements for HINF and PNEU by using informative priors of the sensitivities (e.g., from vaccine probe studies e.g., Feikin et al., (2014)) to adjust the PEF estimates in a coherent Bayesian framework. It is expected that the rare detection of the two bacteria, for HINF and for PNEU from SS data, will lower their PEF estimates relative to the ones obtained from an NPPCR-only analysis.
We include in the regression analysis a cause “Not Specified (NoS)” to account for true pathogen causes other than the seven pathogens. We incorporate the prior knowledge about the TPRs of the NPPCR measures from laboratory experts. We set the Beta priors for sensitivities by and , so that the and quantiles match the lower and upper ranges of plausible sensitivity values of and , respectively. We specify the Beta(7.59,58.97) prior for the two TPRs of SS measurements similarly but with a lower range of . We use a working number of subclasses . In the etiology regression model , we use 7 d.f. for B-spline expansion of the additive function for the standardized enrollment date at uniform knots along with three binary indicators for age older than one, very severe pneumonia, HIV positive; In the subclass weight regression model , we use 5 d.f. for the standardized enrollment date with uniform knots and two indicators for age older than one and HIV positive. The prior distributions for the etiology and subclass weight regression parameters follow the specification in Supplementary Materials A.
The regression analysis produces seasonal estimates of the PEF function for each cause that varies in trend and magnitude among the eight strata defined by age, disease severity and HIV status. Figure 3 shows among two age-HIV-severity strata the posterior mean curve and pointwise credible bands of the etiology regression functions as a function of . For example, among the younger, HIV negative and severe pneumonia children (Figure 3(a)), the PEF curve of RSV is estimated to have a prominent bimodal temporal pattern that peaked at two consecutive winters in the Southern Hemisphere (June 2012 and 2013). Other single-pathogen causes HINF, PNEU, ADENO, HMPV_A_B and PARA_1 have overall low and stable PEF curves across seasons. The estimated PEF curve of NoS shows a trend with a higher level of uncertainty that is complementary to RSV because given any enrollment date the PEFs of all the causes sum to one. In contrast, Figure 3(b) shows a lower degree of seasonal variation of RSV PEF curve among the older, HIV negative and severe pneumonia children.
The regression model accounts for stratification of etiology by the observed covariates and assigns cause-specific probabilities for two cases who have identical measurements but different covariate values. Supplemental Figure S6 shows for two cases with all negative NPPCR results (the most frequent pattern among cases), the older case has a lower posterior probability of her disease caused by RSV and higher probability of being caused by NoS. Indeed, contrasting older and younger children while holding the enrollment date, HIV, severity constant, the estimated difference in the log odds (i.e., log odds ratio) of a child being caused by RSV versus NoS is negative: .
Given age, severity and HIV status, we quantify the overall cause-specific disease burdens by averaging the PEF function estimates by the empirical distribution of the enrollment dates. Contrasting the results in the two age-severity-HIV strata in Figure 3(a) and 3(b), since the case positive rate of RSV among the older children reduces from to but the control positive rates remain similar (from to ), the overall PEF of RSV () decreases from to and attributing a higher total fraction of cases to NoS () from to ; The overall PEFs for other causes remain similar.
6 Discussion
In disease etiology studies where gold-standard data are infeasible to obtain, epidemiologists need to integrate multiple sources of data of distinct quality to draw inference about the population and individual etiologic fractions. While the existing methods based on npLCM account for imperfect diagnostic sensitivities and specificities, complex measurement dependence and missingness, they do not describe the relationship between covariates and the PEFs. This paper addresses this analytic need by extending npLCM to a general regression modeling framework using case-control multivariate binary data to estimate disease etiology.
The proposed approach has three distinguishing features: 1) It allows analysts to specify a model for the functional dependence of the PEFs upon important covariates. And with assumptions such as additivity, we can improve estimation stability for sparsely populated strata defined by many discrete covariates. 2) The model incorporate control data for the inference of PEF curve. The posterior inferential algorithm estimates a parsimonious covariate-dependent reference distribution of the diagnostic measurements from controls. Finally, 3) the model uses informative priors of the sensitivities (TPRs) only once in a population for which these priors were elicited. Relative to a fully-stratified npLCM analysis that reuses these priors, the proposed regression analysis avoids overly-optimistic etiology uncertainty estimates.
We have shown by simulations that the regression approach accounts for population stratification by important covariates and as expected reduces estimation biases and produces credible intervals that have more valid empirical coverage rates than an npLCM analysis omitting covariates. In addition, the proposed regression analysis can readily integrate multiple sources of diagnostic measurements of distinct levels of diagnostic sensitivities and specificities, a subset of which are only available from cases (SS data), to further reduce the posterior uncertainty of the etiology estimates. Our regression analysis integrates the BrS and SS data from one PERCH site and reveals prominent dependence of the PEFs upon seasonality and a pneumonia child’s age, HIV status and disease severity.
Future work may improve the proposed methods. First, flexible and parsimonious alternatives to the additive models may capture important interaction effects (e.g., Linero,, 2018). Second, in the presence of many covariates, class-specific predictor selection methods for may provide further regularization and improve interpretability (Gustafson et al.,, 2008). Third, when the subsets of pathogens that have caused the diseases in the population is unknown, the proposed method can be combined with subset selection procedures (Wu et al.,, 2019; Gu and Xu, 2019a, ). Finally, scalable posterior inference for multinomial regression parameters (e.g., Zhang and Zhou,, 2017) will likely improve the computational speed in the presence of a large number of disease classes and covariates.
Supplementary Materials
The supplementary materials contain the technical details on prior specifications, a remark, additional simulation results and supplemental figures referenced in Main Paper.
Acknowledgment
We thank the PERCH study team led by Kathernine O’Brien for providing the data and scientific advice, Scott Zeger, Maria Deloria-Knoll, Christine Prosperi and Qiyuan Shi for insightful comments and valuable feedback about baker and Jing Chu for preliminary simulations. The research was partly supported by the Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1408-20318, ZW), NIH grants P30CA046592 (National Cancer Institute Cancer Center Support Grant Development Funds, Rogel Cancer Center; ZW and IC), U01CA229437 (ZW) and an Investigator Award from Precision Health Initiative and an MCubed Award from University of Michigan (ZW).
Appendix A Prior distributions
The unknown parameters include the regression coefficients in the etiology regression (, the parameters in the subclass weight regression for the cases () and the controls (), the true and false positive rates . To mitigate potential overfitting and increase model interpretability, we a priori place substantial probabilities on models with the following two features: (a) Few non-trivial subclasses via a novel additive half-Cauchy prior for the intercepts , and (b) for a continuous variable, smooth regression curves , and by Bayesian Penalized-splines (P-splines, Lang and Brezger,, 2004) combined with shrinkage priors on the spline basis coefficients (Ni et al.,, 2015) to encourage towards constant values.
A.1 Subclass Weight Regression: Encourage Few Subclasses
We propose a novel prior to encourage a small number of subclasses of non-trivial weights in finite samples, or “simplex regression shrinkage prior”. We parameterize the intercepts so that a priori the higher-order subclasses are less likely to receive non-trivial weights. We let where is a pre-specified triangular array of positive values. Upon heavy-tailed priors on with positive supports, we will a priori make higher-order subclasses increasingly less likely to receive substantial weights. In this paper, we use ; Other choices such as or may be useful in other settings. We specify the prior distributions of to be heavy-tailed. In this paper we use Cauchy distribution with scale . Since our control model take a classical latent class regression model form (Bandeen-Roche et al.,, 1997) (the generic term “class” here corresponds to control “subclass” in an npLCM), the proposed prior for the subclass weight is also useful for a classical LCM regression analysis where the number of classes is unknown. Unlike a logistic stick-breaking specification without the intercepts , the proposed priors on the intercepts encourage few subclasses and well recovers the true subclass weights. Using the same data simulated in Simulation I, Section 3 of Main Paper, Figure 5 shows the proposed prior propagates into the posterior distribution and estimates 2 non-trivial subclasses from a working number of 7 subclasses.
At stick-breaking step , the prior allows taking away nearly the entire stick segment currently left. Our basic idea is to have one of close to one a posteriori by making the posterior mean of one of large. We accomplish this by designing a novel prior on the intercept where
[TABLE]
The first level has a mean-zero Gaussian distribution truncated to the positive half. At the second-level, the precision (inverse variance) is Gamma distributed with shape , and rate ; it has the interpretation of prior independent sample(s) with a mean sample variance of . Large values of help to stop stick-breaking at subclass forcing weights for ensuing subclasses , , while small values let the stick-breaking scheme continue to step . This type of prior sparsity, which we call “selective stopping” or shrinkage over a simplex uniformly over covariates, effectively encourages using a small number of subclasses to approximate the observed probability contingency table for the control measurements in finite samples.
We accomplish selective stopping by the heavy right tail of ’s marginal prior. It has a truncated scaled- distribution with degree of freedom and scale , and consequently peaks at zero and admits large positive values. Given other parameters in , a near-zero intercept takes the stick-breaking procedure to the next step, while a large positive intercept effectively halts it. The tendency to stop at step is a priori modulated by the scale parameter . Because, given the degree-of-freedom , the prior probability approaches as the scale parameter increases.
In our simulations and applications, we choose hyperparameters and for the intercept, and for the first B-spline coefficients in the prior (Equation 3, Section A.2). We have chosen our hyperparameters based on the interpretations on the probability (inverse-link) scale; see similar prior elicitations for regression coefficients in other applications (e.g., Bedrick et al.,, 1996; Witte et al.,, 1998) and for automatic, stabilized and weakly-informative fitting of generalized linear models (Gelman et al.,, 2008). We choose the hyperparameters for the intercepts that put most prior mass of within , because is sufficiently close to which means the stick-breaking is stopped at Step . In contrast, we choose the first B-spline coefficient’s hyperparameter that puts most prior mass of within , a range for the weight of a non-trivial subclass to break from the rest of the stick at Step . Figure 8 shows a sharp separation between the priors for and . The shapes of the priors again highlight the different roles played by the intercept and the B-spline coefficients: the former decides whether to continue the stick-breaking procedure to induce complex conditional dependence given covariates, and if so, the latter computes the fraction to break from the remaining length of the stick. The intercepts in the controls are shared with the case subclass weight regression ; We set the same prior distributions for other elements of , .
A.2 Encourage Smooth and
We use penalized B-splines to model the additive functions of a continuous variable in etiology regression (), subclass weight regression for the cases and the controls () (Lang and Brezger,, 2004). We expand , with being the shared cubic B-spline bases. We let , in the case subclass weight regression and in the control subclass weight regression have distinct coefficients: , and , respectively. With interior equally-spaced knots : , there are basis functions. It readily extends to let and have different numbers of basis functions.
Since the specification below applies to , and for any centered and standardized continuous variable, for simplicity we omit the superscripts and subscript .
The Penalized-splines in our formulation bypass the choice of the number and placement of knots by using a large number of knots deemed sufficient to capture the curves and imposing smoothing penalty on the coefficients for basis functions to prevent overfitting. The Gaussian random walk priors on basis coefficients are good choices for fitting Bayesian P-splines (Lang and Brezger,, 2004):
[TABLE]
where the symmetric penalty matrix is constructed from the first-order difference matrix of dimension that maps adjacent B-spline coefficients to , (diff(diag(C),differences = 1) in R language), and is the smoothing parameters with large values leading to smoother fit of (constant when ) and interpolation when near zero. This first-order random walk prior above uses a precision matrix of rank to model the adjacent differences. This leaves the prior of unspecified, for which we further assign an independent prior . We discuss the hyperparameter in the next subsection.
We use a mixture prior with two well-separated component distributions with one favoring small and the other large smoothing parameters :
[TABLE]
where the Gamma-distributed component (, ) concentrates near smaller values while the inverse-Pareto component prefers larger values (, ). This bimodal mixture distribution creates a sharp separation between flexible and smooth fits (Morrissey et al.,, 2011; Ni et al.,, 2015). Because we use the first-order random walk prior, the most smooth fit is of degree [math], i.e., constant functions. The random smoothness indicator represents a flexible () or constant ([math]) shape of . We let with success probability and then put a hyperprior to let data inform the degree of smoothness.
In this paper we use , for each set of the B-spline basis coefficients for the cases () and the controls () to a priori give slight preference for constant curves, , ; We use , for the set of basis coefficients () to a priori give slight preference for flexible etiology regression functions, , . In the presence of high-dimensional covariates, the Beta prior with other hyperparameters can also allow a prior spread that lets the fraction of constant functions to approach [math] as .
A.3 Informative Prior Distributions for TPRs and FPRs
The npLCM regression model is partially-identified (Jones et al.,, 2010). We assume independent informative priors for the TPRs in the BrS data likelihood: , , where (,) are chosen so that the and quantiles match a prior range elicited from laboratory scientists (Deloria Knoll et al.,, 2017). In the presence of SS data for a subset of pathogens (e.g., culturing bacteria from blood), we similarly set the hyperparameters for the Beta distribution of the TPRs of the SS measures where ranges can be computed from existing vaccine probe trials (e.g., Feikin et al.,, 2014). Since the control data provide direct estimates of the FPRs, we specify independent priors for .
Appendix B Remark on the Control Model with Covariates
The proposed model for the control data with covariates is a generative model where we first draw a subclass indicator , and generate measurements according to a Bernoulli distribution with positive rate , independently for . By assuming mutually independent measurements given subclass and , we let the covariates influence the dependence structure of the measurement only through the unobserved . As a result, upon integrating over , the proposed model does not assume marginal independence in contrast to a kernel-based extension of the pLCM that makes this assumption (Saha et al.,, 2018, Supplementary appendix). Our approach to incorporating covariates to model control data follows Bandeen-Roche et al., (1997); For other approaches, see examples in the study of particulate matter (Gryparis et al.,, 2007), HIV population size estimation (Bartolucci and Forcina,, 2006), and alcoholic and drug addiction (Chung et al.,, 2006).
Appendix C Convergence Checks
In simulations and data analysis, we ran three MCMC chains each with a burn-in period of iterations followed by iterations stored for posterior inference. We look for potential non-convergence in terms of Gelman-Rubin statistic (Brooks and Gelman,, 1998) that compares between-chain and within-chain variances for each model parameter where a large difference () indicates non-convergence; We also used Geweke’s diagnostic (Geweke and Zhou,, 1996) that compare the observed mean for each unknown variable using the first and the last of the stored samples where a large -score indicates non-convergence (). In our simulations and data analyses, we observed fast convergence (many satisfied convergence criteria within iterations) that led to well recovered regression curves, TPRs and FPRs.
Appendix D Additional Information about Simulations of Main Paper
Simulation I.* we let , and depend on the two covariates , and enrollment date (), so that regression adjustments are necessary (see Remark 1, Main Paper). We simulate BrS measurements on pathogens and assume the number of potential single-pathogen causes . To specify etiology regression functions that satisfy the constraint , we use stick-breaking parameterization with segments. In particular, we let , , for ; Let the PEF functions , where . The true control distribution depend on covariates with subclass weight functions: and . We specify , highlighting the need for using different subclass weights among cases and controls in an npLCM analysis. We set the true TPRs and the FPRs and .*
In the regression analyses, we set to be an additive model of a indicator and a B-spline expansion with degrees of freedom (d.f.) for standardized enrollment date . We use and specify the regression formula for subclass weights and by additive models of the indicator and a B-spline expansion with 5 d.f. for standardized enrollment date.
Simulation II.* We consider causes, under single-pathogen-cause assumption, BrS measurements made on cases and controls for each level of where or . The functions take two sets of values to reflect how variable the PEFs are across the two levels: i) and where causes have uniform PEFs when and causes B and E dominate when , or ii) and to mimic the scenario where pathogens B and E have lower PEFs when and occupy more fractions when . We further let the measurement error parameters take distinct values of the TPRs or and the FPRs , for . Finally, we set the truth where and .*
Simulation II: a randomly chosen replication.* Here we illustrate the inferences about the stratum-specific and overall PEFs that are available to an analyst by considering a two-level covariate with measurements. Under the single-pathogen cause assumption, we can estimate PEFs, six per level of as well as six overall PEFs. For example, based on a single data set simulated under the scenario , , , , , (, Supplemental Figure 6 shows the posterior distribution of the stratum-specific etiology fractions for () by row and causes by column with the true values indicated by the blue vertical dashed lines; The bottom row shows the posterior distribution of for causes with empirical weights , . The true stratum-specific and overall PEFs are covered by their respective CrIs.*
Appendix E Additional Simulation Results
E.1 Estimating
We use simulation studies to show the frequentist performance of the npLCM regression model in recovering stratum-specific PEFs; The results below are based on a single discrete covariate that influence the PEFs but not the subclass weights in the cases or controls.
In this simulation study, we simulate cases and controls for each of sites. Every subject is measured on pathogens A to F; The causes of disease are single-pathogen causes A-F. First, we let the PEFs vary by site which are shown in Table 2. Second, we simulate the data using subclass.
We simulate data under two TPR scenarios (I) strong signal with and where data are expected to provide strong information about the PEFs, and (II) weak signal with and where it is easy to confuse true and false positive results and the data do not provide strong information about the PEFs. In both scenario (I) and (II) , we used a Beta(6,2) distribution as a prior for the TPRs of the BrS measurements. We set the true TPRs and FPRs to be the same across sites and pathogens. In fitting the regression models, we use the etiology regression formulation by specifying sets of regression parameters with site dummy variables as the predictors in . Since our goal is to infer sets of PEFs, we can also specify sets of symmetric Dirichlet priors with hyperparameter (Dir()); We use here. The package baker (https://github.com/zhenkewu/baker) provides an option to use Dirichlet priors when the PEFs depend on discrete covariates only.
E.1.1 Scenario I: Strong Signal
Over replications, the top half of Table 4 summarizes the coverage rates of the credible intervals (CrIs) for the PEFs across all the sites. We observed excellent recovery of the true values across all causes and sites with the CrIs covered the true values between to of the time. Panel I of Table 4 also shows for site 1 the posterior mean PEFs, posterior standard deviations (sd’s) of the PEFs, and posterior mean squared errors (PMSEs, estimated by with retained posterior samples ) averaged over replications. The posterior means provide excellent estimation of the PEFs with small average PMSEs.
E.1.2 Scenario II: Weak Signal
Using data simulated under less discrepant TPRs and FPRs than those in Scenario I, the CrIs cover the truths well for most site-cause pairs, but undercover the truths for causes with the highest PEF in each site (see Table 3). This is expected because when the signal from the data is weak, the model relies more heavily on the uniform prior distribution for the PEFs (symmetric Dirichlet prior with hyper-parameter ).
More Informative TPR Priors (II∗). We further investigate the model performance when we change the TPR prior distributions from the Beta(6,2) to a Beta distribution that has 95% of its mass between 0.525 and 0.575 and is around the true TPRs (Beta(835.95, 683.79); beta_parms_from_quantiles(c(0.525,0.575)) using baker). Panel of Table 4 shows dramatic improvements in the coverage rates. These results suggest that changing the prior distributions of the TPRs so that it is more tightly concentrated around plausible values can improve inferences of the stratum-specific PEFs in the presence of high levels of noises. Relative to Scenario I, the average PMSEs are larger across sites and pathogens reflecting the weaker signal in this setting.
In summary, in the simulation study where the PEFs are influenced by a discrete covariate, the regression model recovers the true values well under high signals (high sensitivities and low FPRs). Under lower sensitivities and higher FPRs, the noisier simulated data are less informative about the PEFs which are then more influenced by the prior distributions of the TPRs and PEFs. In practice, we recommend eliciting quality informative TPR priors from domain scientists as in the PERCH study and perform sensitivity analyses to understand the robustness of the results with respect to the prior distributions.
E.2 Valid inference of omitting covariates
Under assumption (A1) in Remark 1 of Main Paper, the case subclass weights , , we conduct a simulation study to show that an npLCM analysis omitting covariates is able to provide valid inference about the overall PEFs (). The simulation settings are exactly the same as in Simulation II, Section 4 of Main Paper, except that we set to satisfy assumption (A1). Figure 7 shows the percent relative biases are similarly negligible in all the scenarios with disease classes; Figure 7 shows excellent empirical coverage rates of the CrIs for .
Appendix F Supplemental Figures
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bandeen-Roche et al., (1997) Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., and Rathouz, P. J. (1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association , 92(440):1375–1386.
- 2Bartolucci and Forcina, (2006) Bartolucci, F. and Forcina, A. (2006). A class of latent marginal models for capture–recapture data with continuous covariates. Journal of the American Statistical Association , 101(474):786–794.
- 3Bedrick et al., (1996) Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized linear models. Journal of the American Statistical Association , 91(436):1450–1460.
- 4Brooks and Gelman, (1998) Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics , 7(4):434–455.
- 5Carlin and Louis, (2009) Carlin, B. and Louis, T. (2009). Bayesian methods for data analysis , volume 78. Chapman & Hall/CRC.
- 6Chung et al., (2006) Chung, H., Flaherty, B. P., and Schafer, J. L. (2006). Latent class logistic regression: application to marijuana use and attitudes among high school seniors. Journal of the Royal Statistical Society: Series A (Statistics in Society) , 169(4):723–743.
- 7Crawley et al., (2017) Crawley, J., Prosperi, C., Baggett, H. C., Brooks, W. A., Deloria Knoll, M., Hammitt, L. L., Howie, S. R., Kotloff, K. L., Levine, O. S., Madhi, S. A., et al. (2017). Standardization of clinical assessment and sample collection across all perch study sites. Clinical infectious diseases , 64(suppl_3):S 228–S 237.
- 8Deloria Knoll et al., (2017) Deloria Knoll, M., Fu, W., Shi, Q., Prosperi, C., Wu, Z., Hammitt, L. L., Feikin, D. R., Baggett, H. C., Howie, S. R., Scott, J. A. G., et al. (2017). Bayesian estimation of pneumonia etiology: epidemiologic considerations and applications to the pneumonia etiology research for child health study. Clinical infectious diseases , 64(suppl_3):S 213–S 227.
