Regression Analysis of Dependent Binary Data for Estimating Disease   Etiology from Case-Control Studies

Zhenke Wu; Irena Chen

arXiv:1906.08436·stat.ME·June 21, 2019

Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies

Zhenke Wu, Irena Chen

PDF

Open Access

TL;DR

This paper extends a statistical model to incorporate covariate effects in disease cause estimation from case-control studies, improving accuracy and inference of disease etiology.

Contribution

It introduces a regression framework for nested partially-latent class models that accounts for covariates and control data, enhancing disease etiology estimation.

Findings

01

Less biased estimation of population etiologic fractions (PEFs).

02

More valid inference compared to models ignoring covariates.

03

Application reveals disease etiology dependence on season, age, severity, and HIV status.

Abstract

In large-scale disease etiology studies, epidemiologists often need to use multiple binary measures of unobserved causes of disease that are not perfectly sensitive or specific to estimate cause-specific case fractions, referred to as "population etiologic fractions" (PEFs). Despite recent methodological advances, the scientific need of incorporating control data to estimate the effect of explanatory variables upon the PEFs, however, remains unmet. In this paper, we build on and extend nested partially-latent class model (npLCMs, Wu et al., 2017) to a general framework for etiology regression analysis in case-control studies. Data from controls provide requisite information about measurement specificities and covariations, which is used to correctly assign cause-specific probabilities for each case given her measurements. We estimate the distribution of the controls' diagnostic measures…

Tables4

Table 1. Table 1: The observed count (frequency) of cases and controls by age, disease severity and HIV status (1: yes; 0: no). The marginal fractions among cases and controls for each covariate are shown at the bottom. Results from the regression analyses are shown in Figure 3 for the first two strata.

age $\geq 1$	very severe (VS)	HIV positive	$#$ cases ( $%$ )	$#$ controls ( $%$ )
	(case-only)		total: 524 (100)	total: 964 (100)
\rowcolor[HTML]C0C0C0 0	0	0	208 (39.7)	545 (56.5)
\rowcolor[HTML]C0C0C0 1	0	0	72 (13.7)	278 (28.8)
0	1	0	116 (22.1)	0
1	1	0	33 (6.3)	0
0	0	1	37 (7.1)	85 (8.8)
1	0	1	24 (4.5)	51 (5.3)
0	1	1	25 (4.8)	0
1	1	1	3 (0.6)	0
case: $25.2 %$	$34.5 %$	$17.0 %$
control: $34.3 %$	-	$14.1 %$

Table 2. Table 2 : True PEFs for seven sites (boldfaced numbers indicate the highest PEFs within each stratum).

site\cause	A	B	C	D	E	F
1	0.5	0.2	0.15	0.05	0.05	0.05
2	0.2	0.5	0.15	0.05	0.05	0.05
3	0.2	0.15	0.5	0.05	0.05	0.05
4	0.2	0.15	0.05	0.5	0.05	0.05
5	0.2	0.15	0.05	0.05	0.5	0.05
6	0.2	0.15	0.05	0.05	0.05	0.5
7	0.05	0.2	0.15	0.5	0.05	0.05

Table 3. Table 3 : Number of times (out of 100 replications) that the true value is covered by the 95 % percent 95 95\% CrIs (Scenario II, Beta (6,2) prior for the TPRs). Boldfaced numbers indicate the highest PEFs ( 0.5 0.5 0.5 ) within each stratum.

site\cause	A	B	C	D	E	F
1	73	100	100	99	100	100
2	100	79	100	100	100	99
3	100	100	83	98	100	100
4	100	100	100	73	100	99
5	99	100	100	100	85	100
6	100	100	100	99	100	88
7	100	100	100	81	100	99

Table 4. Table 4 : Scenario I and II ∗ : coverage rates of the 95 % percent 95 95\% CrIs; For Site 1, the posterior means, standard deviations (s.d.’s) and PMSE of the stratum-specific PEFs averaged over R = 100 𝑅 100 R=100 replications are also shown. Boldfaced numbers indicate the highest PEFs ( 0.5 0.5 0.5 ) within each stratum.

		site \cause	A	B	C	D	E	F
I	coverage	1	99	93	97	94	96	90
		2	97	90	96	97	95	94
		3	100	95	98	98	95	96
		4	93	94	96	95	92	99
		5	96	94	96	97	95	98
		6	96	97	98	99	95	96
		7	96	97	91	100	95	96
	posterior summary	truth (Site 1)	0.5	0.2	0.15	0.05	0.05	0.05
		average of post. mean	0.495	0.197	0.152	0.053	0.053	0.051
		average of post. s.d.	0.023	0.018	0.016	0.01	0.01	0.01
		average PMSE	0.0010	0.0007	0.0005	0.0002	0.0002	0.0002
II^∗	coverage	1	98	89	98	99	100	100
		2	97	95	96	100	100	99
		3	93	98	91	99	99	100
		4	95	98	100	95	99	100
		5	94	94	99	99	91	100
		6	95	97	100	99	99	90
		7	100	95	94	96	100	99
	posterior summary	truth (Site 1)	0.5	0.2	0.15	0.05	0.05	0.05
		average post. mean	0.417	0.163	0.138	0.091	0.086	0.106
		average post. s.d.	0.27	0.174	0.162	0.135	0.13	0.141
		average PMSE	0.131	0.067	0.056	0.034	0.031	0.042

Equations11

L = L_{1} \cdot L_{0} = {i : Y_{i} = 1 \prod ℓ = 1 \sum L π_{ℓ} \cdot P_{1 ℓ} (M_{i}; Θ, Ψ, η)} \times i^{'} : Y_{i^{'}} = 0 \prod P_{0} (M_{i^{'}}; Ψ, ν),

L = L_{1} \cdot L_{0} = {i : Y_{i} = 1 \prod ℓ = 1 \sum L π_{ℓ} \cdot P_{1 ℓ} (M_{i}; Θ, Ψ, η)} \times i^{'} : Y_{i^{'}} = 0 \prod P_{0} (M_{i^{'}}; Ψ, ν),

L_{1}^{reg}

L_{1}^{reg}

μ_{k 0}^{*} \sim N^{+} (0, τ_{0 k}^{- 1}), τ_{0 k} \sim Gamma (a_{0}, b_{0}), k = 1, \dots, K - 1.

μ_{k 0}^{*} \sim N^{+} (0, τ_{0 k}^{- 1}), τ_{0 k} \sim Gamma (a_{0}, b_{0}), k = 1, \dots, K - 1.

β_{k} ∣ τ_{k}, λ_{k} \sim N (0_{C \times 1}, (τ_{k} K)^{- 1}),

β_{k} ∣ τ_{k}, λ_{k} \sim N (0_{C \times 1}, (τ_{k} K)^{- 1}),

τ_{k j}

τ_{k j}

InvPareto (τ; a, b) = \frac{a}{b} (\frac{τ}{b})^{a - 1}, a > 0, 0 < τ < b,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Bayesian Inference · Pneumonia and Respiratory Infections · Data-Driven Disease Surveillance

Full text

Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies

Zhenke Wu

Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; E-mail: [email protected].

Michigan Institute for Data Science, University of Michigan, Ann Arbor, MI 48109, USA.

Irena Chen

Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA; E-mail: [email protected].

Abstract

In large-scale disease etiology studies, epidemiologists often need to use multiple binary measures of unobserved causes of disease that are not perfectly sensitive or specific to estimate cause-specific case fractions, referred to as “population etiologic fractions” (PEFs). Despite recent methodological advances, the scientific need of incorporating control data to estimate the effect of explanatory variables upon the PEFs, however, remains unmet. In this paper, we build on and extend nested partially-latent class model (npLCMs, Wu et al.,, 2017) to a general framework for etiology regression analysis in case-control studies. Data from controls provide requisite information about measurement specificities and covariations, which is used to correctly assign cause-specific probabilities for each case given her measurements. We estimate the distribution of the controls’ diagnostic measures given the covariates via a separate regression model and a priori encourage simpler conditional dependence structures. We use Markov chain Monte Carlo for posterior inference of the PEF functions, cases’ latent classes and the overall PEFs of policy interest. We illustrate the regression analysis with simulations and show less biased estimation and more valid inference of the overall PEFs than an npLCM analysis omitting covariates. An regression analysis of data from a childhood pneumonia study site reveals the dependence of pneumonia etiology upon season, age, disease severity and HIV status.

Keywords: Bayesian methods; Case-control studies; Disease etiology; Latent class regression analysis; Measurement errors; Pneumonia; Semi-supervised learning.

1 Introduction

In epidemiologic studies of disease etiology, one important scientific goal is to assess the effect of explanatory variables upon disease etiology. Based on multiple binary non-gold-standard diagnostic measurements made on a list of putative causes with different error rates, this paper develops and demonstrates a regression analytic approach for drawing inference about the cause-specific fractions among the case population that depend on covariates. We illustrate the analytic needs raised by a study of pediatric pneumonia etiology.

Pneumonia is a clinical condition associated with infection of the lung tissue, which can be caused by more than 30 different species of microorganisms, including bacteria, viruses, mycobacteria and fungi (Scott et al.,, 2008). The Pneumonia Etiology Research for Child Health (PERCH) study is a seven-country case-control study of the etiology of severe and very severe pneumonia and has enrolled more than $4,000$ hospitalized children under five years of age and more than $5,000$ healthy controls (PERCH Study Group,, 2019). The goal of the PERCH study is to estimate the population fractions of cases due to the pathogen causes, referred to as “population etiologic fractions” (PEFs) and to assign cause-specific probabilities for each pneumonia child given her measurements, termed as “individual etiologic fractions” (IEFs). The PERCH study also aims to understand the variation of the PEFs as a function of factors such as region, season, a child’s age, disease severity, nutrition status and human immunodeficiency virus (HIV) status.

The cause of lung infection cannot, except in rare cases, be directly observed (Hammitt et al.,, 2017). The PERCH study tests the presence or absence of a list of pathogens using specimens in peripheral compartments including the blood, sputum, pleural fluid and nasopharyngeal (NP) cavity (Crawley et al.,, 2017). In this paper, we focus on two sources of imperfect measurements: (a) NP Polymerase Chain Reaction (NPPCR) results from cases and controls that are not perfectly sensitive or specific, referred to as “bronze-standard” (BrS) data; and (b) blood culture (BCX) results from cases only that are perfectly specific but lack sensitivity, referred to as “silver-standard” (SS) data.

Valid inference about the population and individual etiologic fractions must address three salient characteristics of the measurements. First, tests lacking sensitivity such as NPPCR and BCX may miss true causative agent(s) which if unadjusted may underestimate the PEFs. Second, imperfect diagnostic specificities may result in the detection of multiple pathogens in NPPCR that may indicate asymptomatic carriage but not causes of pneumonia. Determining the primary causative agent(s) must use statistical controls. Third, multiple specimens are tested among the cases with only a subset available from the controls. Other large-scale disease etiology studies have raised similar analytic needs and challenges of integrating multiple sources of imperfect measurements of multiple pathogens to produce an accurate understanding of etiology (e.g., Saha et al.,, 2018; Kotloff et al.,, 2013).

To address the analytic needs, Wu et al., (2016) introduced a partially-latent class model (pLCM) as an extension to classical latent class models (LCMs Lazarsfeld,, 1950; Goodman,, 1974) that uses case-control data to estimate the PEFs. This prior work shows the capacity of the multivariate specimen measurements to inform the distribution of unobserved, or “latent” health status for an individual and the population. PLCM is a finite mixture model with $L+1$ components for multivariate binary data where a case observation is drawn from a mixture of $L$ components each representing a cause of disease, or “disease class”; Controls have no infection in the lung hence are assumed drawn from an observed class. The pLCM is a semi-supervised method for learning the unobserved classes, where the “label” (cause of disease) is observed for only a subset of subjects. Let $I_{i}\in\{1,\ldots,L\}$ represent case $i$ ’s disease class which is categorically distributed with probabilities equal to the PEFs $\bm{\pi}=(\pi_{1},\ldots,\pi_{L})^{\top}$ in the $(L-1)$ -dimensional simplex ${\mathcal{S}}_{L-1}=\{\bm{\pi}:\sum_{\ell=1}^{L}\pi_{\ell}=1,0\leq\pi_{\ell}\leq 1\}$ . A case class can represent a single- or multiple-pathogen cause of pneumonia, or pathogen causes not targeted by the assays which we refer to as “Not Specified (NoS)”. PLCM uses a vector of $J$ response probabilities to specify the conditional distribution of the measurements in each class. PLCM is an example of restricted LCMs (RLCMs, Wu et al.,, 2019) which restrict how the response probabilities differ by class to reflect the scientific knowledge that causative pathogens are more likely to appear in the upper respiratory tract in a pneumonia child than a healthy control. In particular, each causative pathogen is assumed to be observed with a higher probability in case class $\ell$ (sensitivity or true positive rate, TPR) than among the controls; A non-causative pathogen is observed with the same probability as in the controls (1 - specificity or false positive rate). Under the pLCM, a higher observed marginal positive rate of pathogen $j$ among cases than controls indicates etiologic significance.

In a Bayesian framework, measurements of differing precisions can be optimally combined under a pLCM to generate stronger evidence about $\bm{\pi}$ . The pLCM is partially-identified (Jones et al.,, 2010; Gu and Xu, 2019b, ). There exist two sets of values of a subset of model parameters (here the TPRs) that the likelihood function alone cannot distinguish even with infinite samples; Bounds on the parameters however are available (e.g., Wu et al.,, 2016, Equation 6). Informative prior distributions for the TPRs elicited from laboratory experts or estimated from vaccine probe studies for a subset of pathogens (Feikin et al.,, 2014) can be readily incorporated to improve inference (Gustafson,, 2015).

The pLCM assumes “local independence” (LI) which means the BrS measurements are mutually independent given the class membership. This classical assumption is central to mixture models for multivariate data, because the estimation procedures essentially find the optimal partition of observations into subgroups so that the LI approximately holds in each subgroup. Deviations from LI, or “local dependence” (LD) are testable using the control BrS data, which can be accounted for by an extension of pLCM, called nested partially-latent class model (npLCM, Wu et al.,, 2017). In each class, the npLCM uses the classical LCM formulation that has the capacity to describe complex multivariate dependence among discrete data (Dunson and Xing,, 2009). For example, it assumes the within-class correlations among NPPCR tests are induced by unobserved heterogeneity in subjects’ propensities for pathogens colonizing the nasal cavities. In particular, LD is induced in an npLCM by nesting $K$ latent subclasses within each class $\ell=0,1,...,L$ , where subclasses respond with distinct vectors of probabilities. In a Bayesian framework with a prior that encourages few important subclasses, the npLCM reduces the bias in estimating $\bm{\pi}$ , retains estimation efficiency and offers more valid inference under substantial deviation from LI.

Extensions to incorporate covariates in an npLCM are critical for two reasons. Firstly, covariates such as season, age, disease severity and HIV status may directly influence $\bm{\pi}$ . Secondly, in an npLCM without covariates, the relative probability of assigning a case subject to class $\ell$ versus class $\ell^{\prime}$ depends on the FPRs (Wu et al.,, 2016) which are estimable using the control data. However, the FPRs may vary by covariates which if not modeled will bias the assignment of cause-specific probabilities for each case subject. For example, pathogen A found in a case’s nasal cavity less likely indicates etiologic significance than a colonization during seasons with high asymptomatic carriage rates, and much more so when the same pathogen rarely appears in healthy subjects.

Adapting existing no-covariate methods to account for discrete covariates, one may perform a fully-stratified analysis by fitting an npLCM to the case-control data in each covariate stratum. Like pLCM, the npLCM is partially-identified in each stratum (Wu et al.,, 2017), necessitating multiple sets of independent informative priors across multiple strata. There are two primary issues with this approach. First, sparsely-populated strata defined by many discrete covariates may lead to unstable PEF estimates. Second, it is often of policy interest to quantify the overall cause-specific disease burdens in a population. Let the overall PEFs $\bm{\pi}^{\ast}=(\pi^{\ast}_{1},\ldots,\pi^{\ast}_{L})^{\top}$ be the empirical average of the stratum-specific PEFs. Since the informative TPR priors are often elicited for a case population and rarely for each stratum, reusing independent prior distributions of the TPRs across all the strata will lead to overly-optimistic posterior uncertainty in $\bm{\pi}^{\ast}$ , hampering policy decisions.

Estimating disease etiology across discrete and continuous epidemiologic factors needs new methods in a general modeling framework. In this paper, we extend the npLCM to perform regression analysis in case-control disease etiology studies that (a) incorporates controls to estimate the PEFs, (b) specifies parsimonious functional dependence of $\bm{\pi}$ upon covariates such as additivity, and (c) correctly assesses the posterior uncertainty of the PEF functions and the overall PEFs $\bm{\pi}^{\ast}$ by applying the TPR priors just once.

The rest of the paper is organized as follows. Section 2 overviews the npLCM without covariates. Section 3 builds on the npLCM and makes the regression extension. We demonstrate the estimation of disease etiology regression functions $\pi_{\ell}(\cdot)$ through simulations in Section 4; We also show superior inferential performance of the regression model in estimating the overall PEFs $\bm{\pi}^{\ast}$ relative to an analysis omitting the covariates. In Section 5, we characterize the effect of seasonality, age, HIV status upon the PEFs by applying the proposed npLCM regression model to the PERCH data. The paper concludes with a discussion.

2 Overview of npLCMs without Covariates

Let binary BrS measurements $\bm{M}_{i}=(M_{i1},...,M_{iJ})^{\top}$ indicate the presence or absence of $J$ pathogens for subject $i=1,\ldots,N$ . Let $Y_{i}$ indicate a case ( $1$ ) or a control ([math]) subject. If $Y_{i}=1$ , let $I_{i}\in\{1,\ldots,L\}$ represent case $i$ ’s unobserved disease class; Otherwise, let $I_{i}=0$ because a control subject’s class is known (in PERCH, no lung infection). In this paper, we simplify the presentation of models by focusing on single-pathogen causes (hence $L=J$ ). The npLCM readily extends to $L>J$ for including additional pre-specified multi-pathogen and/or “Not Specified” (NoS) causes (Wu et al.,, 2017).

The likelihood function for an npLCM has three components: (a) PEFs or cause-specific case fractions: $\bm{\pi}=(\pi_{1},\ldots,\pi_{L})^{\top}=\{\pi_{\ell}=\mathbb{P}(I=\ell\mid Y=1),\ell=1,\ldots,L\}\in{\mathcal{S}}_{L-1}$ ; (b) $\bm{P}_{1\ell}=\{\bm{P}_{1\ell}(\bm{m})\}=\{\mathbb{P}(\bm{M}=\bm{m}\mid I=\ell,Y=1)\}$ : a table of probabilities of making $J$ binary observations $\bm{M}=\bm{m}$ in a case class $\ell\neq 0$ ; (c) $\bm{P}_{0}=\{\bm{P}_{0}(\bm{m})\}=\{\mathbb{P}(\bm{M}=\bm{m}\mid I=0,Y=0)\}$ : the same probability table but for controls. Since cases’ disease classes are unobserved, the distribution of cases’ measurements $\bm{P}_{1}=\mathbb{P}(\bm{M}\mid Y=1)$ is a finite-mixture model with weights $\bm{\pi}$ for the $L$ disease classes: $\bm{P}_{1}=\sum_{\ell=1}^{L}\pi_{\ell}\bm{P}_{1\ell}$ .

Models in this section differ by how $\bm{P}_{0}$ and { $\bm{P}_{1\ell}$ } are specified; Regression models in Section 3 further incorporate covariate into the specifications ( $\bm{\pi}$ as well). More specifically, the likelihood of an npLCM (Wu et al.,, 2017) is a product of case ( $L_{1}$ ) and control ( $L_{0}$ ) likelihood functions

[TABLE]

where $\bm{\Theta}$ and $\bm{\Psi}$ are sensitivity and specificity parameters necessary for modeling the imperfect measurements; The rest of parameters $\bm{\nu}=(\nu_{1},\ldots,\nu_{K})^{\top}$ , $\bm{\eta}=(\eta_{1},\ldots,\eta_{K})^{\top}\in{\mathcal{S}}_{K-1}$ . Existing methods for estimating $\bm{\pi}$ in the framework of npLCM can be classified by whether or not $\bm{P}_{0}$ and $\bm{P}_{1\ell}$ assumes local independence (LI) which means measurements are independent of one another given the class ( $I_{i}=\ell=0,1,\ldots,L$ ). In Equation (1), LI results if and only if $\nu_{1}=\eta_{1}=1$ ; Otherwise, $\bm{\nu}$ and $\bm{\eta}$ account for deviations from LI given a control or disease class.

PLCM. $\bm{P}_{0}(\bm{m})$ under the original pLCM (Wu et al.,, 2016) satisfies LI and equals a product of $J$ probabilities: $\bm{P}_{0}(\bm{m})=\prod_{j=1}^{J}\{\psi_{j}\}^{m_{j}}\{1-\psi_{j}\}^{1-m_{j}}=\Pi(\bm{m};\bm{\psi})$ , where $\Pi(\bm{m};\bm{s})=\prod_{j=1}^{J}\{s_{j}\}^{m_{ij}}\{1-s_{j}\}^{1-m_{ij}}$ is the probability mass function for a product Bernoulli distribution given the success probabilities $\bm{s}=(s_{1},\ldots,s_{J})^{\top}$ , $0\leq s_{j}\leq 1$ and the parameters $\bm{\psi}=(\psi_{1},\ldots,\psi_{J})^{\top}$ represent the positive rates absent disease, referred to as “false positive rates” (FPRs). For example, in the PERCH data, Respiratory Syncytial Virus (RSV) has a low observed FPR because of its rare appearance in controls’ NPs; Other pathogens such as Rhinovirus (RHINO) have higher observed FPRs.

For $\bm{P}_{1\ell}(\bm{m})$ , the pLCM makes a key “non-interference” assumption that disease-causing pathogen(s) are more frequently detected among cases than controls and the non-causative pathogens are observed with the same rates among cases as in controls (Wu et al.,, 2017). The “non-interference” assumption says that $\bm{P}_{1\ell}(\bm{m})$ in a case class $\ell\neq 0$ is a product of the probabilities of measurements made (a) on the causative pathogen $\ell$ , $\mathbb{P}(M_{\ell}\mid I=\ell,Y=1,\bm{\theta})=\{\theta_{\ell}\}^{M_{\ell}}\{1-\theta_{\ell}\}^{1-M_{\ell}}$ , where $\bm{\theta}=(\theta_{1},\ldots,\theta_{L})^{\top}$ and (b) on the non-causative pathogens $\mathbb{P}(\bm{M}_{i[-\ell]}\mid I_{i}=\ell,Y_{i}=1,\bm{\psi}_{[-\ell]})=\Pi(\bm{M}_{[-\ell]};\bm{\psi}_{[-\ell]})$ , where $\bm{a}_{[-\ell]}$ represents all but the $\ell$ -th element in a vector $\bm{a}$ . The parameter $\theta_{\ell}$ is termed “true positive rate” (TPR) and may be larger than the FPR $\psi_{\ell}$ ; Under the single-pathogen-cause assumption, pLCM uses $J$ TPRs $\bm{\theta}$ for $L=J$ causes and $J$ FPRs $\bm{\psi}$ .

NPLCM. To reduce estimation bias in $\bm{\pi}$ under deviations from LI, the “nested pLCM” or npLCM extends the original pLCM to describe residual correlations among $J$ binary pathogen measurements in the controls ( $I_{i}=0$ ) and in each case class ( $I_{i}=\ell$ , $\ell\neq 0$ ) (Wu et al.,, 2017). The extension is motivated by the ability of the classical LCM formulation (Lazarsfeld,, 1950) to approximate any joint multivariate discrete distribution (Dunson and Xing,, 2009).

For $\bm{P}_{0}(\bm{m})$ in the controls, the npLCM introduces $K$ subclasses; The original pLCM results if $K=1$ . Given a subclass $k$ , the probability of observing $J$ binary measurements $\bm{M}=\bm{m}$ among controls is $\bm{P}_{0}^{(k)}(\bm{m})=\mathbb{P}(\bm{M}=\bm{m}\mid Z=k,I=0,Y=0,\{\psi_{k}^{(j)}\})=\Pi(\bm{m};\bm{\Psi}_{k})$ , where $\bm{\Psi}_{k}$ is the $k$ -th column of a $J$ by $K$ FPR matrix $\bm{\Psi}=\{\psi_{k}^{(j)}\}$ . Since we do not observe controls’ subclasses, $\bm{P}_{0}$ is a weighted average of $\bm{P}_{0}^{(k)}$ according to the subclass probabilities $\{\nu_{k}\}$ : $\bm{P}_{0}=\sum_{k}^{K}\nu_{k}\bm{P}_{0}^{(k)}$ .

For $\bm{P}_{1l}(\bm{m})$ in case class $\ell\neq 0$ , the npLCM again introduces $K$ unobserved subclasses and assumes $\bm{P}_{1\ell}$ is a weighted average of $\bm{P}_{1\ell}^{(k)}$ according to the case subclass weights $\{\eta_{k}\}$ : $\bm{P}_{1\ell}=\sum_{k=1}^{K}\eta_{k}\bm{P}_{1\ell}^{(k)}$ . In particular, the npLCM assumes the probability of observing $\bm{M}$ in subclass $k$ in disease class $\ell\neq 0$ , $\bm{P}_{1l}^{(k)}=\mathbb{P}(\bm{M}\mid Z=k,I=\ell,Y=1)$ , is a product of the probabilities of making an observation (a) on the causative pathogen $\ell$ : $\mathbb{P}(M_{\ell}\mid Y=1,Z=k,I=\ell,\theta_{k}^{(\ell)})=\{\theta_{k}^{(\ell)}\}^{M_{\ell}}\{1-\theta_{k}^{(\ell)}\}^{1-M_{\ell}}$ and (b) on non-causative pathogens $\mathbb{P}(\bm{M}_{[-\ell]}\mid Y=1,Z=k,I=\ell,\bm{\Psi}_{k}^{([-\ell])})=\Pi(\bm{M}_{[-\ell]};\bm{\Psi}_{k}^{([-\ell])})=\prod_{j\neq\ell}\{\psi_{k}^{(j)}\}^{m_{j}}\{1-\psi_{k}^{(j)}\}^{1-m_{j}}$ , where $\bm{\Psi}_{k}^{([-\ell])}$ is the $k$ -th column of $\Psi$ excluding the $\ell$ -th row. We collect the TPRs in a $J$ by $K$ TPR matrix $\bm{\Theta}=\{\theta_{k}^{(j)}\}$ . We summarize the preceding specification by $\bm{P}_{1l}^{(k)}=\Pi(\bm{M};\bm{p}_{k\ell}),\ell\neq 0$ , where the vector $\bm{p}_{k\ell}=\{p_{k\ell}^{(j)},j=1,\ldots,J\}$ represents the positive rates for $J$ measurements in subclass $k$ of disease class $\ell$ : $p_{k\ell}^{(j)}=\left\{\theta_{k}^{(j)}\right\}^{\operatorname*{\mathbb{I}}\{j=\ell\}}\cdot\left\{\psi_{k}^{(j)}\right\}^{1-\operatorname*{\mathbb{I}}\{j=\ell\}}$ which equals the TPR $\theta_{k}^{(j)}$ for a causative pathogen and the FPR $\psi_{k}^{(j)}$ otherwise; Here $\operatorname*{\mathbb{I}}\{A\}$ is an indicator function that equals $1$ if the statement A is true and [math] otherwise.

The likelihood for npLCM results upon substituting $\bm{P}_{0}$ and $\bm{P}_{1\ell}$ above into Equation (1): $L=L_{1}\cdot L_{0}=(\prod_{i:Y_{i}=1}\sum_{\ell=1}^{L}\pi_{\ell}[\cdot\sum_{k=1}^{K}\{\eta_{k}\cdot\Pi(\bm{M}_{i};\bm{p}_{k\ell})\}])\times\prod_{i^{\prime}:Y_{i^{\prime}}=0}\sum_{k=1}^{K}\nu_{k}\cdot\Pi(\bm{M}_{i^{\prime}};\bm{\Psi}_{k})$ . Setting $\nu_{1}=\eta_{1}=1$ and $\nu_{k}=\eta_{k}=0,k\geq 2$ , the special case of pLCM results.

Similar to the pLCM, the FPRs $\bm{\Psi}$ in the npLCM are shared among controls and case classes over non-causative pathogens (via $\bm{p}_{kl}$ ). Different from the pLCM, the subclass mixing weights may differ between cases ( $\bm{\eta}$ ) and controls ( $\bm{\nu}$ ). The special case of $\eta_{k}=\nu_{k},k=1,\ldots,K$ , means the covariation patterns among the non-causative pathogens in a disease class is no different from the controls. However, relative to controls, diseased individuals may have different strength and direction of measurement dependence in each disease class. By allowing the subclass weights to differ between the cases and the controls, npLCM is more flexible than pLCM in referencing cases’ measurements against controls.

3 Regression Analysis via npLCM

We extend npLCM to perform regression analysis of data $\mathcal{D}=\{(\bm{M}_{i},Y_{i},\bm{X}_{i}Y_{i},\bm{W}_{i}),i=1,\ldots,N\}$ , where $\bm{X}_{i}=(X_{i1},\ldots,X_{ip})^{\top}$ are covariates that may influence case $i$ ’s etiologic fractions and $\bm{W}_{i}=(W_{i1},\ldots,W_{iq})^{\top}$ is a possibly different vector of covariates that may influence the subclass weights among the controls and the cases; Let the continuous covariates comprise the first $p_{1}$ and $q_{1}$ elements of $\bm{X}_{i}$ and $\bm{W}_{i}$ , respectively. A subset of $\bm{X}_{i}$ may be available from the cases only. We let $\bm{X}_{i}Y_{i}=\bm{0}_{p\times 1}$ if $Y_{i}=0$ so that all the covariates for a control subject are included in $\bm{W}_{i}$ ; Let $\bm{X}_{i}Y_{i}=\bm{X}_{i}$ for a case subject. For example, healthy controls have no disease severity information. We let three sets of parameters in an npLCM (1) depend on the observed covariates: (a) the etiology regression function among cases, $\{\pi_{\ell}(\bm{x}),\ell\neq 0\}$ , which is of primary scientific interest, (b) the conditional probability of measurements $\bm{m}$ given covariates $\bm{w}$ in case classes: $\bm{P}_{1\ell}(\bm{m};\bm{w})=[\bm{M}=\bm{m}\mid\bm{W}=\bm{w},I=\ell]$ , $\ell=1,\ldots,L$ , (c) and in the controls $\bm{P}_{0}(\bm{m};\bm{w})=[\bm{M}=\bm{m}\mid\bm{W}=\bm{w},I=0]$ ; We keep the specifications for the TPRs and FPRs ( $\bm{\Theta}$ , $\bm{\Psi}$ ) as in the original npLCM.

3.1 Disease Etiology Regression

$\pi_{\ell}(\bm{X})$ is the primary target of inference. Recall that $I_{i}=\ell$ represents case $i$ ’s disease being caused by pathogen $\ell$ . We assume this event occurs with probability $\pi_{i\ell}$ that depends upon covariates. In our model, we use a multinomial logistic regression model $\pi_{i\ell}=\pi_{\ell}(\bm{X}_{i})=\exp\{\phi_{\ell}(\bm{X}_{i})\}/\sum_{\ell^{\prime}=1}^{L}\exp\{\phi_{\ell^{\prime}}(\bm{X}_{i})\}$ , $\ell=1,...,L$ , where $\phi_{\ell}(\bm{X}_{i})-\phi_{L}(\bm{X}_{i})$ is the log odds of case $i$ in disease class $\ell$ relative to $L$ : $\log{\pi_{i\ell}}/{\pi_{iL}}$ . Without specifying a baseline category, we treat all the disease classes symmetrically which simplifies prior specification. We further assume additive models for $\phi_{\ell}(\bm{x};\bm{\Gamma}_{\ell}^{\pi})=\sum_{j=1}^{p_{1}}f^{\pi}_{\ell j}(x_{j};\bm{\beta}^{\pi}_{\ell j})+\widetilde{\bm{x}}^{\top}\bm{\gamma}^{\pi}_{\ell}$ , where $\widetilde{\bm{x}}$ is the subvector of the predictors $\bm{x}$ that enters the model for all disease classes as linear predictors and $\bm{\Gamma}_{\ell}^{\pi}=(\bm{\beta}^{\pi}_{\ell j},\bm{\gamma}_{\ell}^{\pi})$ collects all the parameters. For covariates such as enrollment date that serves as a proxy for factors driven by seasonality, nonlinear functional dependence is expected. We use B-spline basis expansion to approximate $f_{\ell j}^{\pi}(\cdot)$ and use P-spline for estimating smooth functions (Lang and Brezger,, 2004). Finally, we specify the distribution of case measurements $\bm{M}$ given disease class $I$ , covariates $\bm{X}$ and $\bm{W}$ . We extend the case likelihood $L_{1}$ in an npLCM (1) to let the subclass weights depend on covariates $\bm{W}$ : $P(\bm{M}\mid\bm{W},I=\ell,Y=1)=\sum_{k=1}^{K}\eta_{k}(\bm{W})\cdot\Pi\left(\bm{M};\bm{p}_{k\ell}\right),\ell=1,\ldots,L$ . Integrating over $L$ unobserved disease classes, we obtain the likelihood function for the cases that incorporates covariates $\{\bm{X}_{i},\bm{W}_{i}\}$ :

[TABLE]

where $\eta_{ik}=h_{k}(\bm{W}_{i};\bm{\Gamma}^{\eta}_{k})$ and $\bm{\Gamma}_{k}^{\eta}$ are the regression parameters; The form of $h_{k}$ is introduced in the model for controls.

3.2 Covariate-dependent reference distribution

Data from controls provide requisite information about the specificities and covariations at distinct covariate values, necessitating adjustment in an npLCM analysis. For example, factors such as enrollment date is a proxy for season and may influence the background colonization rates and interactions of some pathogens that circulate more during winter (Obando-Pacheco et al.,, 2018; Nair et al.,, 2011). We propose a novel approach to estimating the reference distribution of measurements that may depend on covariates using control data.

The regression model for a control subject is a mixture model with covariate-dependent mixing weights $\nu_{k}(\bm{W})$ : $\mathbb{P}(\bm{M}\mid\bm{W},Y=0)=\sum_{k=1}^{K}\nu_{k}(\bm{W})\Pi(\bm{M};\bm{\Psi}_{k}),$ where FPRs $\bm{\Psi}_{k}=(\psi^{(1)}_{k},\ldots,\psi^{(J)}_{k})^{\top}$ do not depend on covariates and the vector $\bm{\nu}(\bm{W})=\left(\nu_{1}(\bm{W}),\ldots,\nu_{K}(\bm{W})\right)^{\top}$ lies in a $(K-1)$ -simplex ${\mathcal{S}}_{K-1}$ . We discuss the FPRs $\{\bm{\Psi}_{k}\}$ and the subclass weight functions $\{\nu_{k}(\bm{W})\}$ in order.

Firstly, constant FPR profiles $\{\bm{\Psi}_{k}\}$ enable coherent interpretation across individuals with different covariate values (Erosheva et al.,, 2007). FPR profile $k$ receives a weight of $\nu_{k}(\bm{W}_{i})$ for a control subject $i$ with covariates $\bm{W}_{i}$ . The marginal FPRs in the controls $\mathbb{P}(\bm{M}_{j}=1\mid\bm{W},Y=0,\bm{\Psi})=\sum_{k=1}^{K}\nu_{k}(\bm{W})\psi_{k}^{(j)}\in[\min_{k}\psi_{k}^{(j)},\max_{k}\psi_{k}^{(j)}]$ , $j=1,\ldots,J$ , also depend on $\bm{W}$ . Consequently, observed marginal control positive curve for a pathogen informs how different the FPRs $\bm{\Psi}_{k}^{(j)}$ are across the subclasses. For example, if the NPPCR measure of pathogen A shows strong seasonal trends among the controls, the estimated FPRs will be more variable across the subclasses. And the subclass with a high FPR will receive a larger weight during seasons with higher carriage rates in controls. The control model reduces to special cases, with covariate-independent $\nu_{k}(\bm{W})\equiv\nu_{k}$ , $k=1,\ldots,K$ , resulting in the $\bm{P}_{0}$ in a $K$ -subclass npLCM without covariates; A further single-subclass constraint ( $K=1$ ) gives the $\bm{P}_{0}$ in the original pLCM.

Secondly, we parameterize the case and control subclass weight regressions $\eta_{k}(\bm{W})$ and $\nu_{k}(\bm{W})$ using the same regression form $h_{k}(\bm{W};\cdot)$ but different parameters.

Control subclass weight regression. We rewrite the subclass weights $\nu_{k}(\cdot),k=1,\ldots,K$ , using a stick-breaking parameterization. Let $g(\cdot):\mathbb{R}\mapsto[0,1]$ be a link function. Let $\alpha_{ik}$ be subject $i$ ’s linear predictor at stick-breaking step $k=1,\ldots,K-1$ . Using the stick-breaking analogy, we begin with a unit-length stick, break a segment of length $g(\alpha^{\nu}_{i1})$ and continue breaking a fraction $g(\alpha^{\nu}_{i2})$ from the remaining $\{1-g(\alpha^{\nu}_{i1})\}$ and so on; At step $k$ , we break a fraction $g(\alpha^{\nu}_{ik})$ from what is left in the preceding $k-1$ steps resulting in the $k$ -th stick segment $k$ of length $\eta_{ik}=g(\alpha^{\nu}_{ik})\prod_{s<k}\{1-g(\alpha^{\nu}_{is})\}$ ; We stop until $K$ sticks of variable lengths result. In this paper, we use the logistic function $g(\alpha)=1/\left\{1+\exp(-\alpha)\right\}$ which is consistent with the multinomial logit regression for $\pi_{\ell}(\cdot)$ so that the priors of the coefficients $\bm{\Gamma}_{k}^{\nu}$ and $\bm{\Gamma}_{\ell}^{\pi}$ can be similar (Supplementary Materials A.2). Generalization to other link functions such as the probit function is straightforward (e.g., Rodriguez and Dunson,, 2011). We use this parameterization to introduce a novel shrinkage prior on a simplex for the subclass weights $\{\nu_{k}(\bm{W})\}$ (see Supplementary Materials A.1) which encourages fewer than $K$ effective subclasses, or “ $m$ -sparse” shrinkage prior on the simplex. This provides parsimonious approximation to the conditional distribution of control measurements $\mathbb{P}(\bm{M}\mid\bm{W},Y=0,\{\nu_{k}(\cdot)\},\bm{\Psi})$ using a few subclasses.

In our analysis, we use generalized additive models (Hastie and Tibshirani,, 1986) for the $k$ -th linear predictor $\alpha^{\nu}_{ik}=\alpha^{\nu}_{k}(\bm{W}_{i}=\bm{w};\bm{\Gamma}^{\nu}_{k})=\mu_{k0}+\sum_{j=1}^{q_{1}}f_{kj}(w_{j};\bm{\beta}^{\nu}_{kj})+\widetilde{\bm{w}}^{\top}\bm{\gamma}^{\nu}_{k},$ for $k=1,\ldots,K-1$ . We have parameterized the possibly nonlinear $f_{kj}(\cdot)$ using B-spline basis expansions with coefficients $\bm{\beta}^{\nu}_{kj}$ ; $\widetilde{\bm{w}}^{\top}\bm{\gamma}^{\nu}_{k}$ are the linear effects of a subset of predictors which can include an intercept and $\widetilde{\bm{w}}$ is a subvector of predictors $\bm{w}$ ; Let $\bm{\Gamma}^{\nu}_{k}=\{\mu_{k0},\{\bm{\beta}^{\nu}_{kj}\},\bm{\gamma}_{k}^{\nu}\}$ collect all the regression parameters. Following Lang and Brezger, (2004), we constrain $\{f_{kj},j=1,\ldots,J\}$ to have zero means for statistical identifiability. Supplementary Materials A.2 provides the technical details about the parameterization of $f_{kj}$ .

The subclass-specific intercepts $\{\mu_{k0}\}$ globally control the magnitudes of the linear predictors. We hence propose priors on $\{\mu_{k0}\}$ to a priori encourage few subclasses (see Supplementary Materials A.1). In particular, a large positive intercept $\mu_{k0}$ makes $g(\alpha^{\nu}_{ik})\approx 1$ and hence breaks nearly the entire remaining stick after the $(k-1)$ -th stick-breaking. Since the stick-breaking parameterization one-to-one maps to a classical latent class regression model formulation for the control data, the linear predictor $\alpha^{\nu}_{ik}$ and the sum $\mu_{k0}+\gamma^{\nu}_{k0}$ are identifiable except in a Lebesgue zero set of parameter values, or “generic identifiability” (Huang and Bandeen-Roche,, 2004). Consequently, even if the intercept $\mu_{k0}$ is not statistically identified if $\widetilde{\bm{w}}$ includes an intercept $\gamma^{\nu}_{k0}$ , the MCMC samples of the statistically identifiable functions can provide valid posterior inferences (Carlin and Louis,, 2009). We write the control likelihood with covariates $\bm{W}_{i}$ as $L^{\tiny\sf reg}_{0}=\prod_{i:Y_{i}=0}\sum_{k=1}^{K}h_{k}(\bm{W}_{i};\bm{\Gamma}_{k}^{\nu})\Pi(\bm{M}_{i};\bm{\Psi}_{k})$ . Supplementary Materials B provides further remarks on the assumption for introducing covariates into the control model.

Case subclass weight regression. The subclass weight regression functions for cases $\{\eta_{k}(\bm{W})\}$ are also specified via a logistic stick-breaking regression as in the controls but with different parameters: $\eta_{ik}=g(\alpha^{\eta}_{ik})\prod_{s<k}\{1-g(\alpha_{is}^{\eta})\}$ , $k=1,\ldots,K-1$ . Since given the TPRs and the FPRs, the subclass weights fully determine the joint distribution $[\bm{M}\mid\bm{W},I=\ell\neq 0]$ hence the measurement dependence in each class, we let $\eta_{k}(\bm{w})$ and $\nu_{k}(\bm{w})$ be different between cases and controls for any $\bm{w}$ .

Let the $k$ -th linear predictor $\alpha_{ik}^{\eta}=\alpha_{k}^{\eta}(\bm{W}_{i}=\bm{w};\bm{\Gamma}_{k}^{\eta})=\mu_{k0}+\sum_{j=1}^{q_{1}}f_{kj}(w_{j};\bm{\beta}_{kj}^{\eta})+\widetilde{\bm{w}}^{\top}\bm{\gamma}_{k}^{\eta}$ , where $\bm{\Gamma}^{\eta}_{k}=\{\mu_{k0},\{\bm{\beta}_{kj}^{\eta}\},\bm{\gamma}_{k}^{\eta}\}$ are the regression parameters that differ from the control counterpart ( $\bm{\Gamma}_{k}^{\nu}$ ). In particular, we approximate $f_{kj}(\cdot),j=1,\ldots,J$ , here using the same set of B-spline basis functions as in the controls but estimate a different set of basis coefficients $\bm{\beta}_{kj}^{\eta}$ . In addition, we have directly used the intercepts $\{\mu_{k0}\}$ from the control model to ensure only important subclasses in the controls are used in the cases. For example, absent covariates $\bm{W}$ , a large and positive $\mu_{k0}$ effectively halts the stick breaking procedure at step $k$ for the controls ( $\nu_{k+1}\approx 0$ ); Applying the same intercept $\mu_{k0}$ to the cases makes $\eta_{k+1}\approx 0$ .

Combining the case ( $L^{\tiny\sf reg}_{1}$ ) and control likelihood ( $L^{\tiny\sf reg}_{0}$ ) with covariates, we obtain the joint likelihood for the regression model $L^{\tiny\sf reg}=L^{\tiny\sf reg}_{1}\times L^{\tiny\sf reg}_{0}$ .

Remark 1.

Under an assumption (A1): the case subclass weights are constant over covariates: $\eta_{k}(\cdot)\equiv\eta_{k}$ , $k=1,\ldots,K$ , the regression model reduces to an npLCM model without covariates upon integration over a distribution of covariates $\bm{X}$ . To see this, the case and control likelihood functions $L^{\tiny\sf reg}_{1}$ and $L^{\tiny\sf reg}_{0}$ integrate to $L_{1}^{*}=\prod_{i:Y_{i}=1}\sum_{\ell=1}^{L}\pi_{\ell}^{\ast}\sum_{k=1}^{K}\eta_{k}\Pi(\bm{M}_{i};\bm{p}_{k\ell}),$ and $L_{0}^{*}=\prod_{i:Y_{i}=0}\sum_{k=1}^{K}\nu^{*}_{k}\Pi(\bm{M}_{i};\bm{\Psi}_{k})$ , respectively; Here $\pi_{\ell}^{\ast}=\int\pi_{\ell}(\bm{X})\mathrm{d}G(\bm{X})$ and $\nu^{\ast}_{k}=\int\nu_{k}(\bm{W})\mathrm{d}H(\bm{W})$ where $G$ and $H$ are probability or empirical distributions of $\bm{X}$ and $\bm{W}$ , respectively. The mathematical equivalence enables valid inference about the overall PEFs $\bm{\pi}^{\ast}$ omitting $\bm{X}$ and $\bm{W}$ (see Supplementary Materials E.2 for an example). The no-covariate analysis becomes deficient under deviations from (A1); Section 4 provides examples.

3.2.1 Priors and Posterior Inference

The unknown parameters include the coefficients in the etiology regression ( $\{\bm{\Gamma}_{\ell}^{\pi}\})$ , the subclass mixing weight regression for the cases ( $\{\bm{\Gamma}_{k}^{\eta}\}$ ) and the controls ( $\{\bm{\Gamma}^{\nu}_{k}\}$ ), the true and false positive rates $(\bm{\Theta}=\{\theta_{k}^{(j)}\},\bm{\Psi}=\{\psi_{k}^{(j)}\})$ . With typical samples sizes about $500$ controls and $500$ cases in each study site, the number of parameters in controls likelihood $L_{0}$ ( $>JKCp$ ) easily exceeds the number of distinct binary measurement patterns observed. To overcome potential overfitting and increase model interpretability, we a priori place substantial probabilities on models with the following two features: (a) Few non-trivial subclasses via a novel additive half-Cauchy prior for the intercepts $\{\mu_{k0}\}$ , and (b) for a continuous variable, smooth regression curves $\pi_{\ell}(\cdot)$ , $\nu_{k}(\cdot)$ and $\eta_{k}(\cdot)$ by Bayesian Penalized-splines (P-splines) (Lang and Brezger,, 2004) combined with shrinkage priors on the spline coefficients (Ni et al.,, 2015) to encourage towards constant values, $\eta_{k}(\cdot)=\eta_{k},\nu_{k}(\cdot)=\nu_{k},k=1,\ldots,K$ , which reduces to the original npLCM. Supplementary Materials A details the prior specifications.

We use the Markov chain Monte Carlo (MCMC) algorithm to draw samples of the unknowns to approximate their joint posterior distribution (Gelfand and Smith,, 1990). Flexible posterior inferences about any functions of the model parameters and individual latent variables are available by plugging in the posterior samples of the unknowns. For example, the posterior samples of the case positive rate curve for pathogen $j$ help evaluate model fit. The red bands in Row 1 of Figure 1 are posterior $95\%$ credible bands obtained by substituting relevant parameters with their sampled values across MCMC iterations in $\mathbb{P}(M_{\ell}=1\mid\bm{x},\bm{w},Y=1)=\pi_{\ell}(\bm{w};\bm{\Gamma}_{\ell}^{\pi})\sum_{k=1}^{K}h_{k}(\bm{w};\bm{\Gamma}_{k}^{\eta})\theta_{k}^{(\ell)}+\{1-\pi_{\ell}(\bm{x};\bm{\Gamma}_{\ell}^{\pi})\}\sum_{k=1}^{K}h_{k}(\bm{w};\bm{\Gamma}_{k}^{\eta})\psi_{k}^{(\ell)}$ . The npLCMs with or without covariates are fitted using a free and publicly available R package baker (https://github.com/zhenkewu/baker). Baker calls an external automatic Bayesian model fitting software JAGS 4.2.0 (Plummer et al.,, 2003) from within R and provides functions to visualize the posterior distributions of the unknowns (e.g., the PEFs and cases’ latent disease class indicators) and perform posterior predictive model checking (Gelman et al.,, 1996). Supplementary Materials C details the convergence diagnostics.

4 Simulations

We simulate case-control bronze-standard (BrS) measurements along with observed continuous and/or discrete covariates under multiple combinations of true model parameter values and sample sizes that mimic the motivating PERCH study. In Simulation I, we illustrate flexible statistical inferences about the PEF functions $\{\pi_{\ell}(\cdot)\}$ . In Simulation II, we focus on the overall PEFs that quantify the overall cause-specific disease burdens in a population which are of policy interest. Let $\pi_{\ell}^{\ast}$ be an empirical average of $\pi_{\ell}(\bm{X})$ , $\ell=1,\ldots,L$ . We compare the frequentist properties of the posterior mean $\bm{\pi}^{\ast}$ obtained from analyses with or without covariate (Little et al.,, 2011). Regression analyses reduce estimation bias, retain efficiency and provide more valid frequentist coverage of the $95\%$ CrIs. The relative advantage varies by the true data generating mechanism and sample sizes.

In all analyses here, we use a working number of $K^{\ast}$ subclasses, with independent Beta(7.13,1.32) TPR prior distributions that match 0.55 and 0.99 with the lower and upper $2.5\%$ quantiles, respectively; We specify Beta(1,1) for the identifiable FPRs. The priors for the regression coefficients follow the specifications in Supplementary Materials A.

Simulation I. We demonstrate that the inferential algorithm recovers the true PEF functions $\{\pi^{0}_{\ell}(\bm{X})\}$ . We simulate $N_{d}=500$ cases and $N_{u}=500$ controls for each of two levels of $S$ (a discrete covariate) and uniformly sample the subjects’ enrollment dates over a period of $300$ days. Supplementary Materials D specifies the true data generating mechanism and the regression specifications. Based on the simulated data, pathogen A has a bimodal positive rate curve mimicking the trends observed of RSV in one PERCH site; other pathogens have overall increasing positive rate curves over enrollment dates. We set the simulation parameters in a way that the marginal control rate may be higher than cases for small $t$ ’s (impossible under the more restrictive pLCM). Row 2 of Figure 1 visualizes for the $9$ causes (by column), the posterior means (thin black line) and $95\%$ CrIs (gray bands) for the etiology regression curves ${\pi}_{\ell}(\cdot)$ are close to the simulation truths $\pi^{0}_{\ell}(\cdot)$ . Supplementary Materials E provides additional simulation results to assess the recovery of the true $\pi^{0}_{\ell}(X)$ for a discrete covariate $X$ .

Simulation II. We show the regression model accounts for population stratification by covariates hence reduces the bias of the posterior mean $\{\widehat{\pi}_{\ell}^{\ast}\}$ in estimating the overall PEFs ( $\bm{\pi}^{\ast}$ ) and produces more valid $95\%$ CrIs. We illustrate the advantage of the regression approach under simple scenarios with a single two-level covariate $X\in\{1,2\}$ ; We let $W=X$ . We perform npLCM regression analysis with $K^{*}=3$ for each of $R=200$ replication data sets simulated under each of $48$ scenarios detailed in Supplementary Materials D that correspond to distinct numbers of causes, sample sizes, relative sizes of PEF functions (rare versus popular etiologies), signal strengths (more discrepant TPRs and FPRs indicate stronger signals, Wu et al., (2016)), and effects of $W$ on $\{\nu_{k}(W)\}$ and $\{\eta_{k}(W)\}$ .

In estimating $\pi^{*}_{\ell}$ , we evaluate the bias $\widehat{\pi^{\ast}_{\ell}}-\pi_{\ell}^{0\ast}$ , where $\pi_{\ell}^{0\ast}=N_{1}^{-1}\sum_{i:Y_{i}=1}\pi^{0}_{\ell}(\bm{X}_{i})$ is the true overall PEF, and $\widehat{\pi^{\ast}_{\ell}}=N_{1}^{-1}\sum_{i:Y_{i}=1}\widehat{\pi}_{\ell}(\bm{X}_{i})$ is an empirical average of the posterior mean PEFs at $\bm{X}_{i}$ . We also evaluate the empirical coverage rates of the $95\%$ CrIs.

The regression model incorporates covariates and performs better in estimating $\bm{\pi}^{\ast}$ than a model omitting covariates. For example, Figure 2(a) shows for $J=6$ that, relative to no-covariate npLCM analyses, regression analyses produce posterior means that on average have negligible relative biases (percent difference between the posterior mean and the truth relative to the truth) for each pathogen across simulation scenarios. As expected, we observe slight relative biases from the regression model in the bottom two rows of Figure 2(a), because the informative TPR prior Beta(7.13,1.32) has a mean value lower than the true TPR $0.95$ ; A more informative prior further reduces the relative bias; See additional simulations in Supplementary Materials E on the role of informative TPR priors. Figure 2(b) regression analyses also produce $95\%$ CrIs for $\pi^{\ast}_{\ell}$ that have more valid empirical coverage rates in all scenarios. Misspecified models without covariates concentrate the posterior distribution away from the true overall PEFs, resulting in large biases that dominate the posterior uncertainty of $\pi_{\ell}^{\ast}$ which is evident from the more severe undercoverages with higher TPRs and lower FPRs (row 3 and 4 versus row 1 and 2, Figure 2).

5 Regression Analysis of PERCH Data

We restrict attention in this regression analysis to $494$ cases and $944$ controls from one of the PERCH study sites in the Southern Hemisphere that collected information on enrollment date ( $t$ , August 2011 to September 2013; standardized), age (dichotomized to younger or older than one year), disease severity for cases (severe or very severe), HIV status (positive or negative) and presence or absence of seven species of pathogens (five viruses and two bacteria, representing a subset of pathogens evaluated) in nasopharyngeal (NP) specimens tested with polymerase chain reaction (PCR), or NPPCR (bronze-standard, BrS); We also include in the analysis the blood culture (BCX, silver-standard, SS) results for two bacteria from cases only. Detailed analyses of the entire data are reported in PERCH Study Group, (2019).

Table 1 shows the observed case and control frequencies by age, disease severity and HIV status. The two strata with the most subjects are severe pneumonia children who were HIV negative and under or above one year of age. Some low or zero cell counts preclude fitting npLCMs by stratum. Regression models with additive assumptions among the covariates can borrow information across strata and stabilize the PEF estimates. Supplemental Figure S5 shows summary statistics for the NPPCR (BrS) and BCX (SS) data including the positive rates in the cases and the controls and the conditional odds ratio (COR) contrasting the case and control rates adjusting for the presence or absence of other pathogens (NPPCR only).

For NPPCR, pathogens RSV and Haemophilus influenzae (HINF) are detected with the highest positive rates among cases: $29.3\%$ and $34.1\%$ , respectively, which are higher than the corresponding control rates ( $3.1\%$ and $21.7\%$ ). The CORs are large, $14~{}(95\%\text{CI:}~{}9.4,21.6)$ for RSV and $1.8~{}(95\%\text{CI:}~{}1.3,2.3)$ for HINF, indicating etiologic importance. Adenovirus (ADENO) also has a statistically significant COR of $1.5~{}(95\%\text{CI:}~{}1.1,2.2)$ . Human metapneumovirus type A or B (HMPV_A_B) and Parainfluenza type 1 virus (PARA_1) have larger positive and statistically significant CORs of $2.6~{}(95\%\text{CI:}~{}1.5,4.4)$ and $6.4~{}(95\%\text{CI:}~{}2.3,20.3)$ . However, the two pathogens rarely appear in cases’ nasal cavities (HMPV_A_B: $6.8\%$ , PARA_1: $2.3\%$ ), which in light of high sensitivities $(50\sim 90)\%$ means non-primary etiologic roles. For the rest of pathogens, we observed similar case and control positive rates as shown by the statistically non-significant CORs (RHINO (case: $21.4\%$ ; control: $19.9\%$ ) and Streptococcus pneumoniae (PNEU) (case: $14.4\%$ ; control: $9.9\%$ ). Similar to Wu et al., (2017), we integrate case-only SS measurements for HINF and PNEU by using informative priors of the sensitivities (e.g., from vaccine probe studies e.g., Feikin et al., (2014)) to adjust the PEF estimates in a coherent Bayesian framework. It is expected that the rare detection of the two bacteria, $0.4\%$ for HINF and $0.2\%$ for PNEU from SS data, will lower their PEF estimates relative to the ones obtained from an NPPCR-only analysis.

We include in the regression analysis a cause “Not Specified (NoS)” to account for true pathogen causes other than the seven pathogens. We incorporate the prior knowledge about the TPRs of the NPPCR measures from laboratory experts. We set the Beta priors for sensitivities by $a_{\theta}=126.8$ and $b_{\theta}=48.3$ , so that the $2.5\%$ and $97.5\%$ quantiles match the lower and upper ranges of plausible sensitivity values of $0.5$ and $0.9$ , respectively. We specify the Beta(7.59,58.97) prior for the two TPRs of SS measurements similarly but with a lower range of $5-20\%$ . We use a working number of subclasses $K=5$ . In the etiology regression model $f^{\pi}_{\ell j}(t)$ , we use 7 d.f. for B-spline expansion of the additive function for the standardized enrollment date $t$ at uniform knots along with three binary indicators for age older than one, very severe pneumonia, HIV positive; In the subclass weight regression model $h_{k}(\bm{W};\cdot)$ , we use 5 d.f. for the standardized enrollment date $t$ with uniform knots and two indicators for age older than one and HIV positive. The prior distributions for the etiology and subclass weight regression parameters follow the specification in Supplementary Materials A.

The regression analysis produces seasonal estimates of the PEF function for each cause that varies in trend and magnitude among the eight strata defined by age, disease severity and HIV status. Figure 3 shows among two age-HIV-severity strata the posterior mean curve and $95\%$ pointwise credible bands of the etiology regression functions $\pi_{\ell}(t,{\sf age},{\sf severity},{\sf HIV})$ as a function of $t$ . For example, among the younger, HIV negative and severe pneumonia children (Figure 3(a)), the PEF curve of RSV is estimated to have a prominent bimodal temporal pattern that peaked at two consecutive winters in the Southern Hemisphere (June 2012 and 2013). Other single-pathogen causes HINF, PNEU, ADENO, HMPV_A_B and PARA_1 have overall low and stable PEF curves across seasons. The estimated PEF curve of NoS shows a trend with a higher level of uncertainty that is complementary to RSV because given any enrollment date the PEFs of all the causes sum to one. In contrast, Figure 3(b) shows a lower degree of seasonal variation of RSV PEF curve among the older, HIV negative and severe pneumonia children.

The regression model accounts for stratification of etiology by the observed covariates and assigns cause-specific probabilities for two cases who have identical measurements but different covariate values. Supplemental Figure S6 shows for two cases with all negative NPPCR results (the most frequent pattern among cases), the older case has a lower posterior probability of her disease caused by RSV and higher probability of being caused by NoS. Indeed, contrasting older and younger children while holding the enrollment date, HIV, severity constant, the estimated difference in the log odds (i.e., log odds ratio) of a child being caused by RSV versus NoS is negative: $-1.82~{}(95\%\text{ CrI}:-2.99,-0.77)$ .

Given age, severity and HIV status, we quantify the overall cause-specific disease burdens $\bm{\pi}^{\ast}$ by averaging the PEF function estimates by the empirical distribution of the enrollment dates. Contrasting the results in the two age-severity-HIV strata in Figure 3(a) and 3(b), since the case positive rate of RSV among the older children reduces from $39.3\%$ to $17.9\%$ but the control positive rates remain similar (from $3.0\%$ to $4.1\%$ ), the overall PEF of RSV ( $\pi^{\ast}_{\tiny\sf RSV}$ ) decreases from $47.7~{}(95\%~{}\text{CrI}:37.6,61.5)\%$ to $17.3~{}(95\%~{}\text{CrI}:8.0,29.1)\%$ and attributing a higher total fraction of cases to NoS ( $\pi^{\ast}_{\tiny\sf NoS}$ ) from $37.6~{}(95\%~{}\text{CrI}:20.3,51.9)\%$ to $56.1~{}(95\%~{}\text{CrI}:29.5,79.3)\%$ ; The overall PEFs for other causes remain similar.

6 Discussion

In disease etiology studies where gold-standard data are infeasible to obtain, epidemiologists need to integrate multiple sources of data of distinct quality to draw inference about the population and individual etiologic fractions. While the existing methods based on npLCM account for imperfect diagnostic sensitivities and specificities, complex measurement dependence and missingness, they do not describe the relationship between covariates and the PEFs. This paper addresses this analytic need by extending npLCM to a general regression modeling framework using case-control multivariate binary data to estimate disease etiology.

The proposed approach has three distinguishing features: 1) It allows analysts to specify a model for the functional dependence of the PEFs upon important covariates. And with assumptions such as additivity, we can improve estimation stability for sparsely populated strata defined by many discrete covariates. 2) The model incorporate control data for the inference of PEF curve. The posterior inferential algorithm estimates a parsimonious covariate-dependent reference distribution of the diagnostic measurements from controls. Finally, 3) the model uses informative priors of the sensitivities (TPRs) only once in a population for which these priors were elicited. Relative to a fully-stratified npLCM analysis that reuses these priors, the proposed regression analysis avoids overly-optimistic etiology uncertainty estimates.

We have shown by simulations that the regression approach accounts for population stratification by important covariates and as expected reduces estimation biases and produces $95\%$ credible intervals that have more valid empirical coverage rates than an npLCM analysis omitting covariates. In addition, the proposed regression analysis can readily integrate multiple sources of diagnostic measurements of distinct levels of diagnostic sensitivities and specificities, a subset of which are only available from cases (SS data), to further reduce the posterior uncertainty of the etiology estimates. Our regression analysis integrates the BrS and SS data from one PERCH site and reveals prominent dependence of the PEFs upon seasonality and a pneumonia child’s age, HIV status and disease severity.

Future work may improve the proposed methods. First, flexible and parsimonious alternatives to the additive models may capture important interaction effects (e.g., Linero,, 2018). Second, in the presence of many covariates, class-specific predictor selection methods for $\pi_{\ell}(\bm{X}_{i})$ may provide further regularization and improve interpretability (Gustafson et al.,, 2008). Third, when the subsets of pathogens that have caused the diseases in the population is unknown, the proposed method can be combined with subset selection procedures (Wu et al.,, 2019; Gu and Xu, 2019a, ). Finally, scalable posterior inference for multinomial regression parameters (e.g., Zhang and Zhou,, 2017) will likely improve the computational speed in the presence of a large number of disease classes and covariates.

Supplementary Materials

The supplementary materials contain the technical details on prior specifications, a remark, additional simulation results and supplemental figures referenced in Main Paper.

Acknowledgment

We thank the PERCH study team led by Kathernine O’Brien for providing the data and scientific advice, Scott Zeger, Maria Deloria-Knoll, Christine Prosperi and Qiyuan Shi for insightful comments and valuable feedback about baker and Jing Chu for preliminary simulations. The research was partly supported by the Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1408-20318, ZW), NIH grants P30CA046592 (National Cancer Institute Cancer Center Support Grant Development Funds, Rogel Cancer Center; ZW and IC), U01CA229437 (ZW) and an Investigator Award from Precision Health Initiative and an MCubed Award from University of Michigan (ZW).

Appendix A Prior distributions

The unknown parameters include the regression coefficients in the etiology regression ( $\{\bm{\Gamma}_{\ell}^{\pi}\})$ , the parameters in the subclass weight regression for the cases ( $\{\bm{\Gamma}_{k}^{\eta}\}$ ) and the controls ( $\{\bm{\Gamma}^{\nu}_{k}\}$ ), the true and false positive rates $(\bm{\Theta}=\{\theta_{k}^{(j)}\},\bm{\Psi}=\{\psi_{k}^{(j)}\})$ . To mitigate potential overfitting and increase model interpretability, we a priori place substantial probabilities on models with the following two features: (a) Few non-trivial subclasses via a novel additive half-Cauchy prior for the intercepts $\{\mu_{k0}\}$ , and (b) for a continuous variable, smooth regression curves $\pi_{\ell}(\cdot)$ , $\nu_{k}(\cdot)$ and $\eta_{k}(\cdot)$ by Bayesian Penalized-splines (P-splines, Lang and Brezger,, 2004) combined with shrinkage priors on the spline basis coefficients (Ni et al.,, 2015) to encourage towards constant values.

A.1 Subclass Weight Regression: Encourage Few Subclasses

We propose a novel prior to encourage a small number of subclasses of non-trivial weights in finite samples, or “simplex regression shrinkage prior”. We parameterize the intercepts $\{\mu_{k0}\}$ so that a priori the higher-order subclasses are less likely to receive non-trivial weights. We let $\mu_{k0}=\sum_{j=1}^{k}u_{kj}\mu_{k0}^{\ast}$ where $u_{kj},1\leq j\leq k\leq K-1$ is a pre-specified triangular array of positive values. Upon heavy-tailed priors on $\mu_{k0}^{\ast}$ with positive supports, we will a priori make higher-order subclasses increasingly less likely to receive substantial weights. In this paper, we use $u_{kj}=1,j=1,\ldots,k$ ; Other choices such as $u_{kj}=\operatorname*{\mathbb{I}}\{k=j\}$ or $u_{kj}=1/k$ may be useful in other settings. We specify the prior distributions of $\mu_{k0}^{\ast}$ to be heavy-tailed. In this paper we use Cauchy distribution with scale $s_{0}=10$ . Since our control model take a classical latent class regression model form (Bandeen-Roche et al.,, 1997) (the generic term “class” here corresponds to control “subclass” in an npLCM), the proposed prior for the subclass weight $\nu_{k}(\bm{W})=h_{k}(\bm{W};\bm{\Gamma}^{\nu}_{k}),k=1,\ldots,K-1$ is also useful for a classical LCM regression analysis where the number of classes is unknown. Unlike a logistic stick-breaking specification $h_{k}(\bm{W};\cdot)$ without the intercepts $\{\mu_{k0}\}$ , the proposed priors on the intercepts $\{\mu_{k0}\}$ encourage few subclasses and well recovers the true subclass weights. Using the same data simulated in Simulation I, Section 3 of Main Paper, Figure 5 shows the proposed prior propagates into the posterior distribution and estimates 2 non-trivial subclasses from a working number of 7 subclasses.

At stick-breaking step $k$ , the prior allows taking away nearly the entire stick segment currently left. Our basic idea is to have one of $\{g(\alpha_{ik})\}_{k=1}^{K-1}$ close to one a posteriori by making the posterior mean of one of $\{\alpha_{ik}\}_{k=1}^{K}$ large. We accomplish this by designing a novel prior on the intercept $\mu_{k0}=\sum_{j=1}^{k}u_{kj}\mu_{k0}^{\ast}$ where

[TABLE]

The first level has a mean-zero Gaussian distribution truncated to the positive half. At the second-level, the precision (inverse variance) is Gamma distributed with shape $a_{0}=\nu/2$ , and rate $b_{0}=\nu s_{0}^{2}/2$ ; it has the interpretation of $\nu$ prior independent sample(s) with a mean sample variance of $s_{0}^{2}$ . Large values of $\tau^{-1}_{0k}$ help to stop stick-breaking at subclass $k$ forcing weights for ensuing subclasses $\nu_{k^{\prime}}\approx 0$ , $k^{\prime}>k$ , while small values let the stick-breaking scheme continue to step $k+1$ . This type of prior sparsity, which we call “selective stopping” or shrinkage over a simplex ${\mathcal{S}}_{K-1}$ uniformly over covariates, effectively encourages using a small number of subclasses to approximate the observed $2^{J}$ probability contingency table for the control measurements in finite samples.

We accomplish selective stopping by the heavy right tail of $\mu^{\ast}_{k0}$ ’s marginal prior. It has a truncated scaled- $t$ distribution with degree of freedom $\nu$ and scale $s_{0}$ , and consequently peaks at zero and admits large positive values. Given other parameters in $\alpha^{\nu}_{ik}=\alpha^{\nu}_{k}(\bm{W}_{i};\bm{\Gamma}_{k}^{\nu})$ , a near-zero intercept takes the stick-breaking procedure to the next step, while a large positive intercept effectively halts it. The tendency to stop at step $k$ is a priori modulated by the scale parameter $s_{0}$ . Because, given the degree-of-freedom $\nu$ , the prior probability $P(g(\alpha_{1k})>C\mid\nu,s_{0}),~{}\forall C\in(0.5,1)$ approaches $1$ as the scale parameter $s_{0}$ increases.

In our simulations and applications, we choose hyperparameters $\nu=1$ and $s_{0}=10$ for the intercept, and $k_{\beta}=4$ for the first B-spline coefficients $\bm{\beta}_{kj}^{(1),\nu}$ in the prior (Equation 3, Section A.2). We have chosen our hyperparameters based on the interpretations on the probability (inverse-link) scale; see similar prior elicitations for regression coefficients in other applications (e.g., Bedrick et al.,, 1996; Witte et al.,, 1998) and for automatic, stabilized and weakly-informative fitting of generalized linear models (Gelman et al.,, 2008). We choose the hyperparameters for the intercepts that put most prior mass of $g(\mu_{10})$ within $(0.5,1-10^{-9})$ , because $1-10^{-9}$ is sufficiently close to $1$ which means the stick-breaking is stopped at Step $k=1$ . In contrast, we choose the first B-spline coefficient’s hyperparameter $k_{\beta}=4$ that puts most prior mass of $g(\beta_{kj}^{(1),\nu})$ within $(0.02,0.98)$ , a range for the weight of a non-trivial subclass to break from the rest of the stick at Step $k$ . Figure 8 shows a sharp separation between the priors for $g(\mu^{*}_{k0})$ and $g(\beta_{kj}^{(1),\nu})$ . The shapes of the priors again highlight the different roles played by the intercept and the B-spline coefficients: the former decides whether to continue the stick-breaking procedure to induce complex conditional dependence given covariates, and if so, the latter computes the fraction to break from the remaining length of the stick. The intercepts in the controls $\{\mu_{k0}\}$ are shared with the case subclass weight regression $\eta_{k}(\bm{W})=h_{k}(\bm{W};\bm{\Gamma}^{\eta}_{k})$ ; We set the same prior distributions for other elements of $\bm{\Gamma}^{\eta}_{k}$ , $k=1,\ldots,K-1$ .

A.2 Encourage Smooth $f^{\pi}_{kj}$ and $f_{kj}$

We use penalized B-splines to model the additive functions of a continuous variable in etiology regression ( $f^{\pi}_{kj}$ ), subclass weight regression for the cases and the controls ( $f_{kj}$ ) (Lang and Brezger,, 2004). We expand $f^{\bullet}_{kj}(\cdot)=\sum_{c=1}^{C}\beta_{kj}^{(c)}B_{j}^{(c)}(\cdot)$ , with $\{B_{j}^{(c)}(\cdot):c=1,\ldots,C\}$ being the shared $C$ cubic B-spline bases. We let $f^{\pi}_{\ell j}$ , $f_{kj}$ in the case subclass weight regression and $f_{kj}$ in the control subclass weight regression have distinct coefficients: $\{\beta_{kj}^{(c),\pi},k=1,\ldots,L\}$ , $\{\beta_{kj}^{(c),\eta},k=1,\ldots,K-1\}$ and $\{\beta_{kj}^{(c),\nu},k=1,\ldots,K-1\}$ , respectively. With $M$ interior equally-spaced knots $\bm{\kappa}=(\kappa_{0},\ldots,\kappa_{M+1})^{\top}$ : $\min_{i}(x_{ij})=\kappa_{0}<\kappa_{1}<\cdots<\kappa_{M}<\kappa_{M+1}=\max_{i}(x_{ij})$ , there are $C=M+4$ basis functions. It readily extends to let $f^{\pi}_{kj}$ and $f_{kj}$ have different numbers of basis functions.

Since the specification below applies to $f^{\pi}_{kj}(x;\bm{\beta}_{kj}^{\pi})$ , $f_{kj}(w;\bm{\beta}_{kj}^{\nu})$ and $f_{kj}(w;\bm{\beta}_{kj}^{\eta})$ for any centered and standardized continuous variable, for simplicity we omit the superscripts $\pi,\nu,\eta$ and subscript $j$ .

The Penalized-splines in our formulation bypass the choice of the number and placement of knots $\bm{\kappa}$ by using a large number of knots deemed sufficient to capture the curves and imposing smoothing penalty on the coefficients for basis functions to prevent overfitting. The Gaussian random walk priors on basis coefficients are good choices for fitting Bayesian P-splines (Lang and Brezger,, 2004):

[TABLE]

where the symmetric penalty matrix $\bm{K}=\Delta_{1}^{\top}\Delta_{1}$ is constructed from the first-order difference matrix $\Delta_{1}$ of dimension $(C-1)\times C$ that maps adjacent B-spline coefficients to $\beta_{k}^{(c)}-\beta_{k}^{(c-1)}$ , $c=2,\ldots,C$ (diff(diag(C),differences = 1) in R language), and $\tau_{k}$ is the smoothing parameters with large values leading to smoother fit of $f_{k}({x})$ (constant when $\tau_{k}=\infty$ ) and interpolation when near zero. This first-order random walk prior above uses a precision matrix $\bm{K}$ of rank $C-1$ to model the adjacent differences. This leaves the prior of $\beta_{k1}$ unspecified, for which we further assign an independent prior $\beta_{k1}\sim N(0,k_{\beta}^{-1})$ . We discuss the hyperparameter $k_{\beta}$ in the next subsection.

We use a mixture prior with two well-separated component distributions with one favoring small and the other large smoothing parameters $\tau_{kj}$ :

[TABLE]

where the Gamma-distributed component ( $a_{\tau}=3$ , $b_{\tau}=2$ ) concentrates near smaller values while the inverse-Pareto component prefers larger values ( $a^{\prime}_{\tau}=1.5$ , $b^{\prime}_{\tau}=400$ ). This bimodal mixture distribution creates a sharp separation between flexible and smooth fits (Morrissey et al.,, 2011; Ni et al.,, 2015). Because we use the first-order random walk prior, the most smooth fit is of degree [math], i.e., constant functions. The random smoothness indicator $\xi_{kj}$ represents a flexible ( $1$ ) or constant ([math]) shape of $f_{k}(\cdot)$ . We let $\xi_{kj}\sim{\sf Bernoulli}(\rho)$ with success probability $\rho$ and then put a hyperprior $\rho\sim{\sf Beta}(a_{\rho},b_{\rho})$ to let data inform the degree of smoothness.

In this paper we use $a_{\rho}=0.5$ , $b_{\rho}=1$ for each set of the B-spline basis coefficients for the cases ( $\{\bm{\beta}^{(c),\eta}_{kj},c=1,\ldots,C\}$ ) and the controls ( $\{\bm{\beta}^{(c),\nu}_{kj},c=1,\ldots,C\}$ ) to a priori give slight preference for constant curves, $k=1,\ldots,K-1$ , $j=1,\ldots,q_{1}$ ; We use $a^{\pi}_{\rho}=1$ , $b^{\pi}_{\rho}=0.5$ for the set of basis coefficients ( $\{\bm{\beta}^{(c),\pi}_{\ell j},c=1,\ldots,C\}$ ) to a priori give slight preference for flexible etiology regression functions, $\ell=1,\ldots,L$ , $j=1,\ldots,p_{1}$ . In the presence of high-dimensional covariates, the Beta prior with other hyperparameters can also allow a prior spread that lets the fraction of constant functions $\rho=\rho_{p}$ to approach [math] as $p\rightarrow\infty$ .

A.3 Informative Prior Distributions for TPRs and FPRs

The npLCM regression model is partially-identified (Jones et al.,, 2010). We assume independent informative priors for the TPRs in the BrS data likelihood: $\theta_{k}^{(j)}\overset{}{\sim}{\sf Beta}(a^{\sf\tiny BrS}_{j},b^{\sf\tiny BrS}_{j})$ , $j=1,\ldots,J$ , where ( $a^{\sf\tiny BrS}_{j}$ , $b^{\sf\tiny BrS}_{j}$ ) are chosen so that the $2.5\%$ and $97.5\%$ quantiles match a prior range elicited from laboratory scientists (Deloria Knoll et al.,, 2017). In the presence of SS data for a subset of pathogens (e.g., culturing bacteria from blood), we similarly set the hyperparameters for the Beta distribution of the TPRs of the SS measures where ranges can be computed from existing vaccine probe trials (e.g., Feikin et al.,, 2014). Since the control data provide direct estimates of the FPRs, we specify independent priors for $\psi_{k}^{(j)}\sim{\sf Beta}(1,1),j=1,\ldots,J,k=1,\ldots,K$ .

Appendix B Remark on the Control Model with Covariates

The proposed model for the control data with covariates $\bm{W}$ is a generative model where we first draw a subclass indicator $Z\mid\bm{W}\sim{\sf Categorical}_{K}\{\bm{\nu}(\bm{W})\}$ , and generate measurements $M_{j}\mid Z=k$ according to a Bernoulli distribution with positive rate $\psi_{k}^{(j)}$ , independently for $j=1,...,J$ . By assuming mutually independent measurements $M_{1},\ldots,M_{J}$ given subclass $Z$ and $Y=0$ , we let the covariates influence the dependence structure of the measurement only through the unobserved $Z$ . As a result, upon integrating over $Z$ , the proposed model does not assume marginal independence $\mathbb{P}(\bm{M}\mid\bm{W},Y=0)=\prod_{j=1}^{J}\mathbb{P}(M_{j}\mid\bm{W},Y=0)$ in contrast to a kernel-based extension of the pLCM that makes this assumption (Saha et al.,, 2018, Supplementary appendix). Our approach to incorporating covariates to model control data follows Bandeen-Roche et al., (1997); For other approaches, see examples in the study of particulate matter (Gryparis et al.,, 2007), HIV population size estimation (Bartolucci and Forcina,, 2006), and alcoholic and drug addiction (Chung et al.,, 2006).

Appendix C Convergence Checks

In simulations and data analysis, we ran three MCMC chains each with a burn-in period of $10,000$ iterations followed by $10,000$ iterations stored for posterior inference. We look for potential non-convergence in terms of Gelman-Rubin statistic (Brooks and Gelman,, 1998) that compares between-chain and within-chain variances for each model parameter where a large difference ( $R_{c}>1.1$ ) indicates non-convergence; We also used Geweke’s diagnostic (Geweke and Zhou,, 1996) that compare the observed mean for each unknown variable using the first $10\%$ and the last $50\%$ of the stored samples where a large $Z$ -score indicates non-convergence ( $|Z|>2$ ). In our simulations and data analyses, we observed fast convergence (many satisfied convergence criteria within $2,000$ iterations) that led to well recovered regression curves, TPRs and FPRs.

Appendix D Additional Information about Simulations of Main Paper

Simulation I.* we let $\pi_{\ell}(\cdot)$ , $\nu_{k}(\cdot)$ and $\eta_{k}(\cdot)$ depend on the two covariates $\bm{X}=\bm{W}=(S,T)$ , $S$ and enrollment date ( $T$ ), so that regression adjustments are necessary (see Remark 1, Main Paper). We simulate BrS measurements on $J=9$ pathogens and assume the number of potential single-pathogen causes $L=J=9$ . To specify etiology regression functions that satisfy the constraint $\sum_{\ell=1}^{L}\pi_{\ell}(\bm{x})=1$ , we use stick-breaking parameterization with $L=9$ segments. In particular, we let ${\sf{logit}}~{}\{g_{1}(s,t)\}=\beta_{1}\operatorname*{\mathbb{I}}(s=1)+\sin(8\pi(t-0.5)/7)$ , ${\sf{logit}}~{}\{g_{2}(s,t)\}=\beta_{2}\operatorname*{\mathbb{I}}(s=1)+4\exp(3t)/(1+\exp(3t))-0.5$ , ${\sf{logit}}(g_{\ell})=\beta_{8}\operatorname*{\mathbb{I}}(s=1)$ for $\ell>2$ ; Let the PEF functions $\pi_{\ell}(s,t)=g_{\ell}(s,t)\prod_{j<\ell}\{1-g_{j}(s,t)\},\ell=1,\ldots,L(=9)$ , where $\beta_{\ell}=0.1,\ell=1,\ldots,8$ . The true control distribution depend on covariates with $K=2$ subclass weight functions: $\nu_{1}(s,t)={\sf{logit}}^{-1}\left\{\gamma^{\nu}_{1}\operatorname*{\mathbb{I}}(s=1)+4\exp(3t)/(1+\exp(3t))-0.5\right\}$ and $\nu_{2}(s,t)=1-\nu_{1}(s,t)$ . We specify $\eta_{k}(s,t)=\nu_{k}(s,-t),k=1,2$ , highlighting the need for using different subclass weights among cases and controls in an npLCM analysis. We set the true TPRs $\theta_{k}^{(j)}=0.95$ and the FPRs $\psi_{1}^{(j)}=0.5$ and $\psi_{2}^{(j)}=0.05$ .*

In the regression analyses, we set $\phi_{\ell}(\bm{X})$ to be an additive model of a $\operatorname*{\mathbb{I}}\{S=2\}$ indicator and a B-spline expansion with $7$ degrees of freedom (d.f.) for standardized enrollment date $t$ . We use $K^{*}=7$ and specify the regression formula for subclass weights $\nu_{k}(\cdot)$ and $\eta_{k}(\cdot)$ by additive models of the $\operatorname*{\mathbb{I}}\{S=2\}$ indicator and a B-spline expansion with 5 d.f. for standardized enrollment date.

Simulation II.* We consider $L=J=3,6,9$ causes, under single-pathogen-cause assumption, BrS measurements made on $N_{d}$ cases and $N_{u}$ controls for each level of $X$ where $N_{d}=N_{u}=250$ or $500$ . The functions $\phi_{\ell}(X)=\beta_{0\ell}+\beta_{1\ell}\operatorname*{\mathbb{I}}\{X=2\}$ take two sets of values to reflect how variable the PEFs are across the two $X$ levels: i) $\bm{\beta}^{\sf i}_{0}=(0,0,0,0,0,0)$ and $\bm{\beta}^{\sf i}_{1}=(-1.5,0,-1.5,-1.5,0,-1.5)$ where causes have uniform PEFs when $X=1$ and causes B and E dominate when $X=2$ , or ii) $\bm{\beta}^{\sf ii}_{0}=(1,0,1,1,0,1)$ and $\bm{\beta}^{\sf ii}_{1}=(-1.5,1,-1.5,-1.5,1,-1.5)$ to mimic the scenario where pathogens B and E have lower PEFs when $X=1$ and occupy more fractions when $X=2$ . We further let the measurement error parameters take distinct values of the TPRs $\theta_{k}^{(j)}=0.95$ or $0.8$ and the FPRs $(\psi_{1}^{(j)},\psi_{2}^{(j)})\in\{(0.5,0.05),(0.5,0.15)\}$ , for $j=1,\ldots,J$ . Finally, we set the truth $\nu_{k}(W)=\eta_{k}(W)={\sf{logit}}^{-1}\left(\gamma_{k0}+\gamma_{k1}\operatorname*{\mathbb{I}}\{W=2\}\right)$ where $(\gamma_{10},\gamma_{11})=(-0.5,1.5)$ and $(\gamma_{20},\gamma_{21})=(1,-1.5)$ .*

Simulation II: a randomly chosen replication.* Here we illustrate the inferences about the stratum-specific and overall PEFs that are available to an analyst by considering a two-level covariate $X=W$ with $J=6$ measurements. Under the single-pathogen cause assumption, we can estimate $12=(2\times 6)$ PEFs, six per level of $X$ as well as six overall PEFs. For example, based on a single data set simulated under the scenario $\{L=6$ , $N_{d}=500$ , $K=2$ , $\theta_{k}^{(j)}=0.8$ , $(\psi_{1}^{(j)},\psi_{2}^{(j)})=(0.5,0.05)$ , ( $\bm{\beta}^{\sf ii}_{0},\bm{\beta}^{\sf ii}_{1})\}$ , Supplemental Figure 6 shows the posterior distribution of the stratum-specific etiology fractions $\pi_{\ell}(X=s)$ for ( $s=1,2$ ) by row and $L(=J)$ causes $(\ell=1,\ldots,6)$ by column with the true values indicated by the blue vertical dashed lines; The bottom row shows the posterior distribution of $\pi_{\ell}^{\ast}=\sum_{s}w_{s}\pi_{\ell}(X=s)$ for $L$ causes with empirical weights $w_{s}=N_{d}^{-1}\sum_{i:Y_{i}=1}\operatorname*{\mathbb{I}}\{X_{i}=s\}$ , $s=1,2$ . The true stratum-specific and overall PEFs are covered by their respective $95\%$ CrIs.*

Appendix E Additional Simulation Results

E.1 Estimating $\pi_{\ell}(X)$

We use simulation studies to show the frequentist performance of the npLCM regression model in recovering stratum-specific PEFs; The results below are based on a single discrete covariate that influence the PEFs but not the subclass weights in the cases or controls.

In this simulation study, we simulate $500$ cases and $500$ controls for each of $7$ sites. Every subject is measured on $6$ pathogens A to F; The causes of disease are single-pathogen causes A-F. First, we let the PEFs vary by site which are shown in Table 2. Second, we simulate the data using $K=1$ subclass.

We simulate data under two TPR scenarios (I) strong signal with $\theta_{1}^{(j)}=0.99$ and $\psi_{1}^{(1)}=0.01$ where data are expected to provide strong information about the PEFs, and (II) weak signal with $\theta_{1}^{(j)}=0.55$ and $\psi_{1}^{(1)}=0.45$ where it is easy to confuse true and false positive results and the data do not provide strong information about the PEFs. In both scenario (I) and (II) , we used a Beta(6,2) distribution as a prior for the TPRs of the BrS measurements. We set the true TPRs and FPRs to be the same across sites and pathogens. In fitting the regression models, we use the etiology regression formulation by specifying $L-1$ sets of regression parameters with site dummy variables as the predictors in $\phi_{\ell}(\cdot)$ . Since our goal is to infer $S=7$ sets of PEFs, we can also specify $S=7$ sets of symmetric Dirichlet priors with hyperparameter $\alpha$ (Dir( $\alpha$ )); We use $\alpha=1$ here. The package baker (https://github.com/zhenkewu/baker) provides an option to use Dirichlet priors when the PEFs depend on discrete covariates only.

E.1.1 Scenario I: Strong Signal

Over $R=100$ replications, the top half of Table 4 summarizes the coverage rates of the $95\%$ credible intervals (CrIs) for the PEFs across all the sites. We observed excellent recovery of the true values across all causes and sites with the $95\%$ CrIs covered the true values between $90\%$ to $100\%$ of the time. Panel I of Table 4 also shows for site 1 the posterior mean PEFs, posterior standard deviations (sd’s) of the PEFs, and posterior mean squared errors (PMSEs, estimated by $B^{-1}\sum_{b=1}^{B}\sum_{i:Y_{i}=1}\{\pi_{\ell}(X_{i}=s;\bm{\gamma}^{\pi,(b)})-\pi_{\ell}^{0}(X_{i}=s)\}$ with $B$ retained posterior samples $\{\bm{\gamma}^{\pi,(b)}\}$ ) averaged over $R$ replications. The posterior means provide excellent estimation of the PEFs with small average PMSEs.

E.1.2 Scenario II: Weak Signal

Using data simulated under less discrepant TPRs and FPRs than those in Scenario I, the $95\%$ CrIs cover the truths well for most site-cause pairs, but undercover the truths for causes with the highest PEF in each site (see Table 3). This is expected because when the signal from the data is weak, the model relies more heavily on the uniform prior distribution for the PEFs (symmetric Dirichlet prior with hyper-parameter $1$ ).

More Informative TPR Priors (II∗). We further investigate the model performance when we change the TPR prior distributions from the Beta(6,2) to a Beta distribution that has 95% of its mass between 0.525 and 0.575 and is around the true TPRs (Beta(835.95, 683.79); beta_parms_from_quantiles(c(0.525,0.575)) using baker). Panel $II^{\ast}$ of Table 4 shows dramatic improvements in the coverage rates. These results suggest that changing the prior distributions of the TPRs so that it is more tightly concentrated around plausible values can improve inferences of the stratum-specific PEFs in the presence of high levels of noises. Relative to Scenario I, the average PMSEs are larger across sites and pathogens reflecting the weaker signal in this setting.

In summary, in the simulation study where the PEFs are influenced by a discrete covariate, the regression model recovers the true values well under high signals (high sensitivities and low FPRs). Under lower sensitivities and higher FPRs, the noisier simulated data are less informative about the PEFs which are then more influenced by the prior distributions of the TPRs and PEFs. In practice, we recommend eliciting quality informative TPR priors from domain scientists as in the PERCH study and perform sensitivity analyses to understand the robustness of the results with respect to the prior distributions.

E.2 Valid inference of $\pi^{\ast}_{\ell}$ omitting covariates

Under assumption (A1) in Remark 1 of Main Paper, the case subclass weights $\bm{\eta}_{k}(\bm{W})=\eta_{k}$ , $k=1,\ldots,K$ , we conduct a simulation study to show that an npLCM analysis omitting covariates is able to provide valid inference about the overall PEFs ( $\bm{\pi}_{\ell}^{\ast}$ ). The simulation settings are exactly the same as in Simulation II, Section 4 of Main Paper, except that we set $\gamma_{20}=\gamma_{21}=0$ to satisfy assumption (A1). Figure 7 shows the percent relative biases are similarly negligible in all the $16$ scenarios with $6$ disease classes; Figure 7 shows excellent empirical coverage rates of the $95\%$ CrIs for $\{\pi_{\ell}^{\ast}\}$ .

Appendix F Supplemental Figures

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bandeen-Roche et al., (1997) Bandeen-Roche, K., Miglioretti, D. L., Zeger, S. L., and Rathouz, P. J. (1997). Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association , 92(440):1375–1386.
2Bartolucci and Forcina, (2006) Bartolucci, F. and Forcina, A. (2006). A class of latent marginal models for capture–recapture data with continuous covariates. Journal of the American Statistical Association , 101(474):786–794.
3Bedrick et al., (1996) Bedrick, E. J., Christensen, R., and Johnson, W. (1996). A new perspective on priors for generalized linear models. Journal of the American Statistical Association , 91(436):1450–1460.
4Brooks and Gelman, (1998) Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics , 7(4):434–455.
5Carlin and Louis, (2009) Carlin, B. and Louis, T. (2009). Bayesian methods for data analysis , volume 78. Chapman & Hall/CRC.
6Chung et al., (2006) Chung, H., Flaherty, B. P., and Schafer, J. L. (2006). Latent class logistic regression: application to marijuana use and attitudes among high school seniors. Journal of the Royal Statistical Society: Series A (Statistics in Society) , 169(4):723–743.
7Crawley et al., (2017) Crawley, J., Prosperi, C., Baggett, H. C., Brooks, W. A., Deloria Knoll, M., Hammitt, L. L., Howie, S. R., Kotloff, K. L., Levine, O. S., Madhi, S. A., et al. (2017). Standardization of clinical assessment and sample collection across all perch study sites. Clinical infectious diseases , 64(suppl_3):S 228–S 237.
8Deloria Knoll et al., (2017) Deloria Knoll, M., Fu, W., Shi, Q., Prosperi, C., Wu, Z., Hammitt, L. L., Feikin, D. R., Baggett, H. C., Howie, S. R., Scott, J. A. G., et al. (2017). Bayesian estimation of pneumonia etiology: epidemiologic considerations and applications to the pneumonia etiology research for child health study. Clinical infectious diseases , 64(suppl_3):S 213–S 227.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies

Abstract

1 Introduction

2 Overview of npLCMs without Covariates

3 Regression Analysis via npLCM

3.1 Disease Etiology Regression

3.2 Covariate-dependent reference distribution

Remark 1**.**

3.2.1 Priors and Posterior Inference

4 Simulations

5 Regression Analysis of PERCH Data

6 Discussion

Supplementary Materials

Acknowledgment

Appendix A Prior distributions

A.1 Subclass Weight Regression: Encourage Few Subclasses

A.2 Encourage Smooth fkjπf^{\pi}_{kj}fkjπ​ and fkjf_{kj}fkj​

A.3 Informative Prior Distributions for TPRs and FPRs

Appendix B Remark on the Control Model with Covariates

Appendix C Convergence Checks

Appendix D Additional Information about Simulations of Main Paper

Appendix E Additional Simulation Results

E.1 Estimating πℓ(X)\pi_{\ell}(X)πℓ​(X)

E.1.1 Scenario I: Strong Signal

E.1.2 Scenario II: Weak Signal

E.2 Valid inference of πℓ∗\pi^{\ast}_{\ell}πℓ∗​ omitting covariates

Appendix F Supplemental Figures

Remark 1.

A.2 Encourage Smooth $f^{\pi}_{kj}$ and $f_{kj}$

E.1 Estimating $\pi_{\ell}(X)$

E.2 Valid inference of $\pi^{\ast}_{\ell}$ omitting covariates