sourceR: Classification and Source Attribution of Infectious Agents   among Heterogeneous Populations

Poppy Miller; Jonathan Marshall; Nigel French; Chris Jewell

arXiv:1702.07422·stat.AP·July 5, 2017·PLoS Comput. Biol.

sourceR: Classification and Source Attribution of Infectious Agents among Heterogeneous Populations

Poppy Miller, Jonathan Marshall, Nigel French, Chris Jewell

PDF

TL;DR

sourceR is an R package that uses Bayesian modeling and strain typing data to attribute human zoonotic infections to sources, aiding food safety interventions and understanding pathogen behavior.

Contribution

It introduces a novel Bayesian framework with non-parametric clustering for strain types, improving source attribution accuracy in zoonotic disease studies.

Findings

01

Successfully applied to Campylobacter jejuni data from New Zealand

02

Enables straightforward attribution of human cases to sources

03

Detects high-virulence strain types through clustering

Abstract

Zoonotic diseases are a major cause of morbidity, and productivity losses in both humans and animal populations. Identifying the source of food-borne zoonoses (e.g. an animal reservoir or food product) is crucial for the identification and prioritisation of food safety interventions. For many zoonotic diseases it is difficult to attribute human cases to sources of infection because there is little epidemiological information on the cases. However, microbial strain typing allows zoonotic pathogens to be categorised, and the relative frequencies of the strain types among the sources and in human cases allows inference on the likely source of each infection. We introduce sourceR, an R package for quantitative source attribution, aimed at food-borne diseases. It implements a fully joint Bayesian model using strain-typed surveillance data from both human cases and source samples, capable of…

Tables1

Table 1. Table 1: Description and definition of the model parameters.

Parameter	Description	Estimation
$λ_{i j t l}$	Number of human cases from type $i$ , source $j$	$λ_{i j t l} = α_{j t l} \cdot q_{i} \cdot r_{i j t} \cdot k_{j t}$
	time $t$ and location $l$
$λ_{i t l}$	Number of human cases from type $i$	$λ_{i t l} = \sum_{j = 1}^{m} λ_{i j t l}$
	time $t$ and location $l$
$λ_{j t l}$	Number of human cases from source $j$	$λ_{j t l} = \sum_{i = 1}^{n} λ_{i j t l}$
	time $t$ and location $l$
$y_{i t l}$	Number of human cases from type $i$	$y_{i} \sim Poisson (λ_{i t l})$
	time $t$ and location $l$
$x_{i j t}$	Number of positive samples (that were	Data
	successfully MLST typed) from source $j$ , type $i$
	time $t$
$h_{i j t}$	Number of positive samples (PCR) that	Data
	could not be MLST typed.
$s_{j t}^{+}$	Total number of samples from source $j$ time $t$	Data
$k_{j t}$	Prevalence of contamination for each source	$\sum_{i = 1}^{I} (x_{i j t} + h_{i j t}) / s_{j t}^{+}$
$r_{i j t}$	Relative occurrence of type $i$ on source $j$	$𝐫_{j t} \sim Dirichlet (a_{r})$
	time $t$	or $x_{i j t} / \sum_{i = 1}^{n} x_{i j t}$
$p_{i j t}$	Absolute prevalence of type $i$ in source $j$	$r_{i j t} \cdot k_{j t}$
	time $t$
$α_{j t l}$	Unknown source effect for source $j$	$α_{t l} \sim Dirichlet (a_{α})$
	time $t$ and location $l$
$q_{i}$	Unknown type effect for type $i$ in group $k$ ,	$𝐪 \sim DP (Gamma (a_{θ}, b_{θ}), a_{q})$
	where group $k$ has an unknown value $θ_{k}$

Equations27

y_{i}

y_{i}

λ_{i}

r_{\cdot j} \sim Dirichlet (1) \forall j

r_{\cdot j} \sim Dirichlet (1) \forall j

k_{j} \sim Beta (1, 1) \forall j

p_{ij} \sim Beta (w_{ij}, v_{ij})

p_{ij} \sim Beta (w_{ij}, v_{ij})

y_{i tl} \sim Poisson (λ_{i tl} = q_{i} j = 1 \sum m α_{j tl} p_{ij t})

y_{i tl} \sim Poisson (λ_{i tl} = q_{i} j = 1 \sum m α_{j tl} p_{ij t})

x_{j t} \sim \mbox M u l t in o mia l (s_{j t}^{+}, r_{j t})

x_{j t} \sim \mbox M u l t in o mia l (s_{j t}^{+}, r_{j t})

p_{ij t} = r_{ij t} k_{j t}

p_{ij t} = r_{ij t} k_{j t}

q_{i} \sim \mbox D P (a_{q}, Q_{0}) .

q_{i} \sim \mbox D P (a_{q}, Q_{0}) .

λ_{ij} = \frac{r _{ij}}{\sum _{j} r _{ij}} y_{i}

λ_{ij} = \frac{r _{ij}}{\sum _{j} r _{ij}} y_{i}

λ_{j} = i \sum λ_{ij}

λ_{j} = i \sum λ_{ij}

θ_{k} \sim Gamma (a_{θ} + i : S_{i} = k \sum y_{i}, b_{θ} + i : S_{i} = k \sum j = 1 \sum m α_{j} \cdot p_{ij})

θ_{k} \sim Gamma (a_{θ} + i : S_{i} = k \sum y_{i}, b_{θ} + i : S_{i} = k \sum j = 1 \sum m α_{j} \cdot p_{ij})

p_{h^{⋆}} = a_{q} \int_{Θ} L (y_{i} ∣ θ, λ_{i}^{⋆}) d P_{0} (θ) = \frac{b _{θ}^{a_{θ}} ( a _{θ} + y _{i} )}{Γ ( a _{θ} ) ( b _{θ} + λ _{i}^{⋆} ) ^{a_{θ} + y_{i}}}

p_{h^{⋆}} = a_{q} \int_{Θ} L (y_{i} ∣ θ, λ_{i}^{⋆}) d P_{0} (θ) = \frac{b _{θ}^{a_{θ}} ( a _{θ} + y _{i} )}{Γ ( a _{θ} ) ( b _{θ} + λ _{i}^{⋆} ) ^{a_{θ} + y_{i}}}

p_{h}

p_{h}

p_{h^{⋆}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

sourceR: Classification and Source Attribution of Infectious Agents among Heterogeneous Populations

Poppy Miller

CHICAS, Faculty of Health and Medicine, Lancaster University, Lancaster, UK

Jonathan Marshall

Institute of Fundamental Sciences, Massey University, Palmerston North, New Zealand

mEpiLab, Massey University, Palmerston North, New Zealand

Nigel French

mEpiLab, Massey University, Palmerston North, New Zealand

New Zealand Food Safety Science and Research Centre

New Zealand Institute for Advanced Studies

Chris Jewell

New Zealand Food Safety Science and Research Centre

Abstract

Zoonotic diseases are a major cause of morbidity, and productivity losses in both humans and animal populations. Identifying the source of food-borne zoonoses (e.g. an animal reservoir or food product) is crucial for the identification and prioritisation of food safety interventions. For many zoonotic diseases it is difficult to attribute human cases to sources of infection because there is little epidemiological information on the cases. However, microbial strain typing allows zoonotic pathogens to be categorised, and the relative frequencies of the strain types among the sources and in human cases allows inference on the likely source of each infection. We introduce sourceR, an R package for quantitative source attribution, aimed at food-borne diseases. It implements a fully joint Bayesian model using strain-typed surveillance data from both human cases and source samples, capable of identifying important sources of infection. The model measures the force of infection from each source, allowing for varying survivability, pathogenicity and virulence of pathogen strains, and varying abilities of the sources to act as vehicles of infection. A Bayesian non-parametric (Dirichlet process) approach is used to cluster pathogen strain types by epidemiological behaviour, avoiding model overfitting and allowing detection of strain types associated with potentially high ’virulence’.

sourceR is demonstrated using Campylobacter jejuni isolate data collected in New Zealand between 2005 and 2008. Chicken from a particular poultry supplier was identified as the major source of campylobacteriosis which is qualitatively similar to results of previous studies using the same dataset. Additionally, the software identifies a cluster of 9 MLSTs with abnormally high ’virulence’ in humans.

sourceR enables straightforward attribution of cases of zoonotic infection to putative sources of infection by epidemiologists and public health decision makers. As sourceR develops, we intend it to become an important and flexible resource for food-borne disease attribution studies.

1 Introduction

1.1 Background

Zoonotic diseases are a major source of human morbidity world wide. In 2010, there were an estimated 600 million cases globally [1], of which 96 million were Campylobacter spp. (resulting in 21 thousand deaths [2]). Attributing cases of food-borne disease to putative sources of infection is crucial to identify and prioritise food safety interventions. Traditional approaches to source attribution include full risk assessments, analysis and extrapolation of surveillance or outbreak data, and analytical epidemiological studies [3]. However, their results can be highly uncertain due to long and variable disease incubation times, and many and various exposures of an individual to potential sources of infection. Given this difficulty, quantitative methods using pathogen strain type frequency have shown promise for statistically identifying important sources of food-borne illness [4].

For a given disease, quantitative source attribution uses typing of pathogen isolates from human cases and suspected sources of infection (food and environmental). Samples are screened for the presence of the pathogen, with isolates then categorised using a typing methodology. Multilocus Sequence Typing (MLST) is a commonly used genotyping method providing a relatively coarse characterisation of isolates of bacterial species [5]. An MLST sequence type is defined as a unique combination of alleles at several gene loci, typically located in conserved regions of the genome [6].

Routine surveillance for food-borne pathogens is now commonplace in many countries and is performed by national authorities, for example FoodNet in the US [7], the Danish Zoonosis Centre (food.dtu.dk), and the Ministry for Primary Industries in New Zealand (foodsafety.govt.nz). Despite this availability of data we are unaware of any previous implementations in standard statistical software for source attribution modelling, with past analyses being performed using a variety of ad hoc methodologies. Moreover, current statistical source attribution models have strong assumptions, computational approximations or inherent identifiability problems (discussed further in the ‘Review of models and notation’ section).

This paper presents an R package sourceR, which implements a flexible Bayesian non-parametric model, designed for use by epidemiologists and public health decision makers to attribute cases of zoonotic infection to putative sources of infection. We first describe a motivating example and review previous source attribution models before describing our model innovations, demonstrating the software, and discussing results and future directions.

1.2 Motivating example

In 2006, New Zealand had one of the highest incidences of campylobacteriosis in the developed world, with an annual incidence in excess of 400 cases per 100,000 people [8]. The data set was first published in [9], with a detailed description of the data (and data collection methods) available in [10] and [11]. A campaign to change poultry processing procedures, supported in part by results from previous quantitative source attribution approaches, was successful in leading to a sharp decline in campylobacteriosis incidence after 2007 [4].

The data consists of MLST-genotyped Campylobacter isolates (from both human cases of campylobacteriosis and potential food and environmental sources) collected between 2005 and 2008 in the Manawatu region of New Zealand. These data are included in our sourceR package (named campy). We use this data set as a case study, and compare our results with previously published statistical approaches.

1.3 Review of models and notation

In this section we define our notation, and briefly review the approaches that have been used previously to analyse campy. For a given time period, we denote by $y_{i}$ the number of human cases of a disease caused by pathogen type $i=1,\dots,n$ . For the same time period, we let $s_{j}$ denote the total number of source samples collected from source $j=1,\dots,m$ , for which $x_{ij}$ are positive for pathogen type $i$ .

1.3.1 Hald model

The approach of Hald et al. [12] was to compare the number of human cases caused by different pathogen types with their prevalence in different food sources (whilst accounting for type and source specific effects). This requires a heterogenous distribution of pathogen types among the food sources. The number of human cases for each type $y_{i}$ is modelled as a Poisson random variable with mean given by a linear combination of source specific effects, type specific effects and source sample Campylobacter contamination prevalences.

[TABLE]

where for source $j$ $c_{j}$ is the annual exposure, $p_{ij}=r_{ij}\times k_{j}$ is the absolute prevalence of each pathogen type with $k_{j}=\sum_{i=1}^{n}x_{ij}/s_{j}$ the prevalence of positive samples and $r_{ij}=\frac{x_{ij}}{\sum_{i=1}^{n}x_{ij}}$ the relative prevalence of each pathogen type.

The unknown parameters in the model are the vectors $\bm{q}$ and $\bm{\alpha}$ . Here, $\bm{q}$ represents the characteristics that determine a type’s capacity to cause an infection (such as survivability during food processing, pathogenicity and virulence), and $\bm{\alpha}$ accounts for the ability of a particular source to act as a vehicle of infection. These parameters are interpreted further in S4 Appendix.. Inference is performed in a Bayesian framework allowing the model to explicitly include and quantify the uncertainty surrounding each of the parameters.

Equation 2 over-specifies the model, with $m+n$ parameters (the source and type effects) but only $n$ independent observations (the observed human case totals $y_{i}$ ). In the original approach [12], identifiability was obtained by a priori clustering of the elements of $\bm{\alpha}$ and $\bm{q}$ . In constrast, the Modified Hald model [4] prefers to reduce the effective number of parameters by treating $\bm{q}$ as a log Normal $(0,\tau)$ distributed random effect. However, a strong prior is needed on $\tau$ to shrink $\bm{q}$ towards 0 sufficiently to avoid overfitting the model, the choice of which is arbitrary.

The Modified Hald model introduces uncertainty into the relative prevalence matrix by modelling the source sampling process. This model was fitted in WinBUGS using an approximate two stage process [4]. First, a posterior distribution was estimated for the absolute prevalence of source types $\bm{p}$ , using the model specified in Eqs 3 and 4 :

[TABLE]

The marginal posterior for each element of $\bm{p}$ was then approximated by a Beta distribution

[TABLE]

using the method of moments to calculate $w_{ij}$ and $v_{ij}$ . These were used as independent priors for each $p_{ij}$ which removes the constraint that they sum to $k_{j}$ over each type $i$ . Thus, the absolute prevalence for source $j$ ( $\sum_{i=1}^{I}p_{ij}$ ) is no longer constrained to be a probability (as it may be larger than 1).

1.3.2 Asymmetric Island model

The Asymmetric Island Model [13, 14] takes a different approach to the models described above. Here, the evolutionary processes (mutation, migration and recombination) of the sequence types are modelled to infer probabilistically the source of each human infection using genetic data from each subtype. The extra information in the genetic typing allows the model to attribute human cases from a type not observed in any sources to a likely source of infection by comparing the genetic similarity to other types that are observed in the sources. This is not possible with the Hald or Modified Hald models, however, they are much simpler with fewer assumptions and a wider range of suitable data (for example, phenotypic typing can be used). We include results from this model as a comparison in the ‘Results’ section.

2 Design and Implementation

Our approach addresses the problems inherent in both the Hald and Modified Hald models. We introduce a fully joint model for both source and human case sampling allowing us to integrate over uncertainty in the source sampling process, estimating both the prevalence of contaminated source samples and the relative prevalence of each identified type (without resorting to an approximate marginal probability distribution on $\bm{p}$ ). Furthermore, we introduce non-parametric clustering of pathogen types using a Dirichlet Process (DP) model on the type effect vector $\bm{q}$ , providing an automatic data-driven way of reducing the effective number of parameters to aid model identifiability. We are able, therefore, to circumvent the Hald model requirement for heuristically grouping pathogen types, as well as avoiding an arbitrary prior distribution specification for the random effect precision parameter ( $\tau$ ) required by the Modified Hald model.

Often, human case data is associated with location such as urban/rural, or even GPS coordinates. On the other hand, food samples are likely to be less spatially constrained due to distances between production and sale locations. Also, both human and source data may exist for multiple time-periods. We therefore denote the number of human cases of time $i$ occurring in time-period $t$ at location $l$ by $y_{itl}$ , the number of samples of source $j$ in time-period $t$ by $s_{jt}$ , with the type counts $x_{ijt}$ . We allow for different exposures of humans to sources in different locations, by allowing the source effects to vary between times and locations, $\alpha_{jtl}$ .

2.1 Model

As with the Hald model, we assume the number of human cases $y_{itl}$ identified by isolation of subtype $i$ in time-period $t$ at location $l$ is Poisson distributed

[TABLE]

For each source $j$ , we model the number of positive source samples

[TABLE]

where $\bm{x}_{jt}=\left(x_{ijt},i=1,...,n\right)^{T}$ denotes the vector of type-counts in source $j$ in time-period $t$ , $s^{+}_{jt}=\sum_{i=1}^{n}x_{ijt}$ denotes the number of positive samples obtained, and $\bm{r}_{jt}$ denotes a vector of relative prevalences $Pr\left(\textsf{type}_{i}|\textsf{source}_{j},\textsf{time}_{t}\right)$ . The advantage of this model is that it automatically places the constraint $\sum_{i=1}^{n}r_{ijt}=1$ , avoiding the approximation made in [4] where independent Beta-distributed priors were assigned marginally to components of $\bm{r}_{jt}$ . The source case model is then coupled to the human case model through the simple relationship

[TABLE]

where $k_{jt}$ is the prevalence of any isolate in source $j$ in time-period $t$ .

In principle, a Beta distribution could be used to model $k_{jt}$ , arising as the conjugate posterior distribution of a Binomial sampling model for $s^{+}_{jt}$ positive samples from $s_{jt}$ tested, and a Beta prior on $k_{jt}$ . We instead choose to fix the source prevalences at their empirical estimates ( $k_{jt}=s^{+}_{jt}/s_{jt}$ ) because the number of source samples is typically high.

The type effects $\bm{q}$ , which are assumed invariant across time or location, are drawn from a DP with base distribution $Q_{0}$ and a concentration parameter $a_{q}$

[TABLE]

The DP groups the elements of $\bm{q}$ into a finite set of clusters $1:\kappa$ (unknown a priori) with values $\theta_{1},...,\theta_{\kappa}$ meaning bacterial types are clustered into groups with similar epidemiological behaviour.

Heterogeneity in the source matrix $\bm{x}$ is absolutely required to identify clusters from sources, which may not be guaranteed a priori due to the observational nature of the data collection.

2.2 Inference

This section describes how the model is fitted in a Bayesian context by first describing the McMC algorithm used to fit this model, then developing the prior model.

2.2.1 McMC algorithm

The joint model over all unobserved and observed quantities is fitted using Markov chain Monte Carlo (McMC, full details in S6 Appendix.). The source effects and relative prevalence parameters are updated using independent adaptive Metropolis-Hastings updates [15]. The type effects $\bm{q}$ are modelled using a DP (Eq 8) with a Gamma base distribution $Q_{0}\sim\mbox{Gamma}(a_{\theta},b_{\theta})$ . As the Gamma distribution is conjugate with respect to the Poisson likelihood (Eq 5), it is possible to use a marginal Gibbs sampler within a Polya Urn, or “Chinese restaurant process” construction [16] (see S6 Appendix.). This was chosen over the more general “Stick breaking process” because it allows sampling from the conditional posterior of $\bm{\theta}$ . This is particularly important when the elements of $\bm{\theta}$ values are highly dispersed: a base distribution with little mass near the locations of some of the true values for the groups results in poor mixing for the group allocations using the stick breaking algorithm (as it is difficult for a type to change group when no other groups have a suitable $\theta$ value). In contrast, the marginal scheme allows an element of $\bm{q}$ to move into a new cluster, then samples a $\theta$ value directly from the conditional posterior for that group, improving group mixing dramatically.

2.2.2 Priors

The source and type parameters ( $\bm{\alpha}_{tl}$ for all $t$ and $l$ , and $\bm{q}$ respectively) account for a multitude of source and type specific factors which are difficult to quantify a priori. Therefore, with no single real-world interpretation, the distributional form of the priors were chosen for their flexibility. A Dirichlet prior is placed on each $\mathbf{r}_{jt}$ which suitably constrains its L1 norm, i.e. $\sum_{i=1}^{n}r_{ijt}=1$ . A Dirichlet prior is also placed on each $\bm{\alpha}_{tl}$ , with the constrained L1 norm aiding identifiability between the mean of the source and type effect parameters. For more detail on specifying parameters for the Dirichlet Process and priors see the S2 Appendix..

2.3 Code implementation

Standard McMC packages (e.g. WinBUGS, Stan, PyMC3) all lack the capability to implement marginal Gibbs sampling for Dirichlet processes, necessitating a custom McMC framework (see section ‘Extensibility’). We chose R as a platform because of its ubiquity in epidemiology, and advanced support for post-processing of McMC samples. Minimal dependencies on other R packages are required, and are installed automatically.

sourceR uses an object-oriented design, which allows separation of the model from the McMC algorithm. Internally, the model is represented as a directed acyclic graph (DAG, see S3 Appendix.) in which nodes are represented by an R6 class hierarchy. Generic adaptive Metropolis Hastings algorithms are attached to each parameter node, with the conditional independence properties of the DAG allowing automatic computation of the required (log) conditional posterior densities.

A difficulty with the DAG setup is the representation of the Dirichlet process model on the type effects $\bm{q}$ , since each update of the marginal Gibbs sampler requires structural alterations. Therefore, we subsume the entire Dirichlet process into a single node, with a bespoke marginal Gibbs sampling algorithm written for our Gamma base-distribution and Poisson likelihood model.

3 Results

The case study below (using the campy (campylobacteriosis) data set described in the ‘Motivation’ section) illustrates how the sourceR package is used in practice to identify important sources of infection. We compare the results of our Bayesian non-parametric approach with results from the Modified Hald and Asymmetric Island models, and additionally the historical ‘Dutch’ model (see S1 Appendix. and [17]). The priors for our model were selected to be non-informative.

The prevalence $k_{j}$ is calculated by dividing the number of positive samples by the total number of samples for each source. In the data below, we note that for several samples the MLST typing failed, with the number of positive samples exceeding the apparent total number of MLST-typed isolates. However, assuming MLST typing fails independently of pathogen type, this does not bias our results.

The work flow for fitting the model begins with removing types with no source cases and calculating the prevalences $k$ .

⬇

data(campy)

zero_rows <- which(apply(campy[, c(2 : 7)], 1, sum) == 0)

campy <- campy[-zero_rows,]

total_samples = c(239, 196, 127, 595, 552, 524)

positive_samples = c(181, 113, 109, 97, 165, 86)

k <- data.frame(Value = positive_samples / total_samples,

            Source = **colnames**(campy[, 2:7]),

            Time = **rep**("1", 6),

            Location = **rep**("A", 6))

The data and model parameters are set using the HaldDP$new() constructor. Starting values are selected automatically unless provided via a list named init to the constructor.

⬇

priors <- list(a_theta = 0.01, b_theta = 0.00001,

           a_alpha = 1, a_r = 0.1)

my_model <- HaldDP$new(data = campy, k = k, priors = priors, a_q = 0.1)

McMC control parameters are be passed via fit_params

⬇

my_model$fit_params(n_iter = 1000, burn_in = 10000, thin = 500)

The model is run using the update function. Additional iterations may be appended using append = TRUE.

⬇

set.seed(59623)

my_model$update()

# my_model$update(n_iter = 10000, append = T)

We provide the extract method for ease of access to the complex posterior.

⬇

## returns the posterior for the r, alpha, q, c,

## lambda_i, lambda_j and lambda_j_prop parameters

my_model$extract()

The extract function returns the posterior for the selected parameters as a list with a multidimensional array for each of alpha, r, q, s, lambda_j and lambda_i.

Trace and autocorrelation plots for the parameters indicate that the Markov chain is mixing well and has converged, and that thinning by 500 is adequate (Figure 1). The residual plots for the $\lambda_{i}$ s (Figure 2) show that the model fits well.

⬇

## Plot the marginal posteriors for the following parameters

## source effect for Chicken supplier C

plot(my_model $**extract**(params="alpha", sources="ChickenC")$ alpha, type="l")

## type effect 25

plot(my_model $**extract**(params="q", types="25")$ q, type="l")

## relative prevalence for source effect Ovine, type 354

plot(my_model $**extract**(params="r", sources="Ovine", types="354")$ r,

 type="l")

## number of cases attributed to Chicken supplier B

plot(my_model $**extract**(params="lambda_j", sources="ChickenB")$ lambda_j,

 type="l")

## number of cases attributed to sub type 42

plot(my_model $**extract**(params="lambda_i", types="42")$ lambda_i, type="l")

The summary() function calculates medians and credible intervals calculated with three possible methods (percentile, SPIn [18], or Chen-Shao [19]).

⬇

my_model$summary(alpha = 0.05, CI_type = "percentiles")

These can be used to plot an observation versus fitted plot as follows

⬇

## The summary for lambda i is a 4D array with

## [type, time, location, (median/ CI_lower/ CI_upper)]

med_li_vals <- my_model$summary(alpha = 0.05, params = "lambda_i",

            time = "1", location = "A")$lambda_i[, , , "median"]

human_cases <- my_model $**print**_**data**()$ y

plot(med_li_vals, human_cases)

\MakeFramed

\endMakeFramed

See S5 Appendix. and S8 Appendix. for more details on using the package.

3.1 Type effect marginal distributions and clustering

To visualise how the DP has clustered the type effects, we use Gower’s distance [20] to compute a dissimilarity matrix between all possible pairs of types. To aid interpretation of the posterior clustering of the type effects under the DP, we provide a method plot_heatmap() that plots the clustering as a heatmap with a dendrogram. Figure 4 shows that the DP identified four main pathogen type clusters.

⬇

my_model$plot_heatmap()

The violin plots of the marginal posterior distributions for each type effect (Figure 5) show the largest group of types has very small type effects. These correspond to types observed in few source samples and no human cases. Consequently, there is very little information for their type effects which results in very wide credible intervals. The other three groups have much larger type effects. The clustering results identify clusters of strains having particular traits that could be explored using further genotyping or phenotyping assays.

3.2 Comparison of the proportion of cases attributed to each source

Figure 3 shows the proportion of cases attributed to each source for the HaldDP model and three commonly used source attribution models. The median values are similar between all models except the Dutch method [17]. The Dutch model confidence intervals are very narrow because there are far fewer parameters in the model; however, the lack of source and type effects in the model biases the results. The credible intervals produced by the Island model may be narrow due to more accuracy (as additional genetic information is used). The wide credible intervals for the the HaldDP and modified Hald models may be due to C. jejuni’s complex epidemiology resulting in relatively large uncertainty for the disease origin [4], and posterior correlations between some parameters. In particular, the new model shows that the proportion of cases attributed to poultry supplier A is negatively correlated with the proportion of cases attributed to both ovine and poultry supplier B sources (Pearson correlation coefficients of -0.60 and -0.65 respectively, see Fig 7 in S7 Appendix.). The HaldDP model gives a more accurate representation of the uncertainty inherent in source attribution. Some of this non-identifiability is not fully explored in the Modified Hald model [4] as fitting the model in two stages does not allow full propagation of the uncertainty. In particular, when calculating the hyper-parameters for the Beta priors for each $p_{ij}$ from the first stage model, the authors imposed a minimum $\alpha_{ij}$ of 1. This prevents ’bath tub’ shaped beta priors for any $p_{ij}$ s which makes the model easier to fit at the expense of discouraging full exploration of the marginal posteriors for $p_{ij}$ s that truly have a bath-tub shape.

4 Availability and Future Directions

The stable release version of sourceR is available from the Comprehensive R Archive Network, released under a GPL-3 licence. The development version is available at http://fhm-chicas-code.lancs.ac.uk/millerp/sourceR. As this package develops, we intend sourceR to become a platform for new source attribution model development, providing a central analytic resource for public health professionals. The establishment of a standard package with a familiar interface will therefore lead to improved repeatability and reusability of source attribution analyses, supporting national public health and hygiene policy decisions.

4.1 Package extensibility

With increased interest in source attribution models for both food-borne pathogens, and sourceR has been written with extensibility in mind, with the DAG representation allowing for rapid construction of modified or new models. The package routines are written in R (as opposed to C or C++) to aid readability, with the node class hierarchy and three stage workflow designed to aid the addition of new model classes. All internal classes and methods are documented to enable prospective developers to familiarise themselves with the source code quickly. We note that the DAG framework is not limited solely to source attribution models and may used for other Bayesian applications, particularly those for which a Dirichlet process is required.

4.2 Model extensions

The main focus of extending sourceR will be on modelling spatiotemporal correlation in the time- and location- dependent parameters. With the trend in collecting precise geolocation data with human cases, and improved traceability of food, a spatiotemporal correlation model on $\bm{\alpha}_{tl}$ could be used to identify particular foci of source contamination, therefore enabling targeted investigation of particular food supply regions. Implementation of time varying type effects may be appropriate, particularly in the face of evidence that Campylobacter can evolve quickly, with genetic variation conferring virulence not apparent from course-scale MLST typing [22]. Interaction terms between some sources and types would allow for the biologically plausible possibility that certain types are more or less likely to survive and cause disease, dependent on the food source they appear in. This would occur if a specific type was particularly well adapted to a certain food source. However, including interaction terms would significantly increase the number of parameters and reduce identifiability of the model.

4.3 Increasing McMC efficiency

Testing has revealed that the current Metropolis-Hastings based fitting algorithm suffers a loss of efficiency if the source matrix is sparse or highly unbalanced, imbuing negative correlations between certain type/source effect combinations. Gradient-based fitting algorithms such as Hamiltonian Monte Carlo (HMC) [23] are designed to converge to high-dimensional, non-orthogonal target distributions much more quickly, and are a target of future development. In particular, the No U-Turn Sample (NUTS) presents an attractive method for tuning HMC adaptively, a quality which we consider necessary to minimise user intervention and maximise research productivity [24].

5 Conclusions

We have presented a novel source attribution model which builds upon, and unites, the Hald and Modified Hald approaches. It is widely applicable, fully joint, and does not require approximations or a large number of assumptions. Mixing and a posteriori correlations are significantly decreased in comparison to the Modified Hald model. Furthermore, it allows the data to inform type effect clustering using a Bayesian non-parametric model which identifies groups of bacterial sub types with similar putative virulence, pathogenicity and survivability. This is a significant improvement over the previous attempts to improve model identifiability (fixing some source and type effects, or modelling the type effects as random using a 2 stage model). Like the Modified Hald model, the new model incorporates uncertainty in the prevalence matrix into the model, however, it does this by fitting a fully joint model rather than a 2 step model. This has the advantage of allowing the human cases to influence the uncertainty in the source cases and preserves the restriction on the sum of the prevalences for each source. The sourceR package implements this model to enable straightforward attribution of cases of zoonotic infection to putative sources of infection by epidemiologists and public health decision makers.

6 Supporting Information

S1 Appendix.

Dutch model overview

The Dutch method [17] is one of the simplest models for source attribution. It compares the number of reported human cases caused by a particular bacterial subtype with the relative occurrence of that subtype in each source. The number of reported cases per subtype and reservoir is estimated by:

[TABLE]

where $r_{ij}$ is the relative occurrence of bacterial subtype $i$ in source $j$ , $y_{i}$ is the estimated number of human cases of type $i$ per year, $\lambda_{ij}$ is the expected number of cases per year of type $i$ from source $j$ . A summation across types gives the total number of cases attributed to source $j$ , denoted by $\lambda_{j}$ :

[TABLE]

As the Dutch model has no inherent statistical noise model, confidence intervals for the estimated total attributed cases $\hat{\lambda}_{j}$ by bootstrap sampling over the data set. This model implicitly assumes that there are no source or type specific effects (such as differing virulence of types, or differing consumption of food sources) which is not plausible for most zoonoses.

S1 Table.

Summary of model parameters.

The following table gives a list of the model parameters for easy reference.

S2 Appendix.

Dirichlet Priors and Process details

The Dirichlet Process is a random probability measure defined by a base distribution $Q_{0}$ and a concentration parameter $a_{q}$ [25]. The base distribution constitutes a prior distribution in the values of each element of the type effects $\mathbf{q}$ whilst the concentration parameter encodes prior information on the number of groups $K$ to which the pathogen types are assigned. For small values of $a_{q}$ , samples from the DP are likely to have a small number of atomic measures with large weights. For large values, most samples are likely to be distinct, and hence, concentrated on $Q_{0}$ . A value of 1 implies that, a priori, two randomly selected types have probability 0.5 of belonging to the same cluster [16].

Specifying the Dirichlet Process base distribution and concentration parameters:

The concentration parameter of the DP is specified by the analyst as a modelling decision. The concentration parameter specifies how strong the prior grouping is. In the limit $a\rightarrow 0$ , all types will be assigned to one group, increasing $a$ makes a larger number of groups increasingly likely. The Gamma base distribution $Q_{0}$ induces a prior for the cluster locations. This prior should not be too diffuse because if these locations are too spread out, the penalty in the marginal likelihood for allocating individuals to different clusters will be large, hence the tendency will be to overly favour allocation to a single cluster. However, the prior parameters may have a stronger effect than anticipated due to the small size of the relative prevalence and source effect parameters. This can been seen by considering the marginal posterior for $\theta_{k}$

[TABLE]

The term $\sum_{i:S_{i}=k}\sum_{j=1}^{m}\alpha_{j}\cdot p_{ij}$ is very small (due to the Dirichlet priors on $\mathbf{\alpha}$ and $\mathbf{r}_{j}$ ), which can result in even a fairly small rate parameter ( $b_{\theta}$ ) dominating.

Specifying Dirichlet priors:

The simplest Dirichlet priors for the source effects and relative prevalences are symmetric (meaning all of the elements making up the parameter vector $\bm{a}$ have the same value $a$ , called the concentration parameter). Symmetric Dirichlet distributions are used as priors when there is no prior knowledge favouring one component over another. When $a$ is equal to one, the symmetric Dirichlet distribution is uniform over all points in its support. Values of the concentration parameter above one prefer variates that are dense, evenly distributed distributions, whilst values of the concentration parameter below 1 prefer sparse distributions. Note, a prior of 1 for the relative prevalences is too strong (if a relatively non-informative prior is preferred) when there are many observed zero’s in the source data; a prior value of 0.1 is more suitable. A more informative prior can be specified by using a non-symmetric Dirichlet distribution. The magnitude of the vector of $\bm{a}$ parameters corresponds to the strength of the prior. The relative values of the $\bm{a}$ vector corresponds to prior information on the comparative sizes of the parameters.

S3 Appendix.

Directed acyclic graph of the model

S4 Appendix.

Further details about the interpretation of the source and type effects.

The interpretation of source and type effects depends on the quality and type of data collected, the model specification, and the characteristics of the organism of interest. Source effects account for factors such as the amount of the food source consumed, the physical properties of the source and the environment provided for the bacteria through storage and preparation. Including an environmental source in the model can be thought of as grouping the (individually) unmeasured wildlife sources into one. It may also be a transmission pathway for pathogens present in livestock sources (for example, through the contamination of waterways) which complicates the interpretation meaning the source effects no longer directly summarise the ability of the source to act as a vehicle for food-borne infections [12]. Future work could involve attributing the water/ environmental samples to the other sources of infection (such as contamination from bovine, ovine, poultry, or other animal sources). Therefore, it would be possible to estimate the proportion of cases attributed to a sample directly, and via the environement.

S5 Appendix.

**Helpful details regarding use of **sourceR

The sourceR package currently allows the relative prevalence matrix to be fixed at the maximum likelihood estimates, which includes zero values where a particular type was not detected in any samples from a source. Fixing the relative prevalence matrix increases the posterior precision (and significantly reduces run time), but the results may be biased if the source data is not of high quality. Reducing the number of elements in the relative prevalence matrix $r$ that get updated at each iteration can significantly reduce computation time, however, the chains will converge more slowly.

Care must be taken in performing marginal interpretations of the number of type parameters. It is much easier to split a group into two (with similar group means) than it is to merge two groups with clearly different means. Hence, a histogram of the number of groups per iteration is positively skewed compared to the true number of groups. When fitting the model with simulated data, visually assessing the dendrogram and heatmap to determine the number of groups usually provides a closer value to the true number of groups than looking at a histogram, particularly when the group means are well separated.

S6 Appendix.

Full McMC Algorithm. This section gives the full details of the algorithm used to fit our fully joint non-parametric source attribution model. The outline McMC is shown in Algorithm 1. The Dirichlet distributed source effects $\bm{\alpha}_{tl}$ across times $t$ and locations $l$ (Step 1), and the relative prevalences $\bm{r}_{jt}$ across sources $j$ and times $t$ (Step 2) are updated using a constrained adaptive multisite logarithmic Metropolis-Hastings update step for 95% of proposals, and a constrained adaptive multisite Metropolis-Hastings update step for the remainder to prevent the chain getting stuck at very low values [15]. The adaptive algorithm updates the tuning value every 50 updates of the parameter. This is further explained in Algorithm 2.

For the Dirichlet process prior on $\bm{q}$ , a marginal Gibbs sampler is constructed, as described in Algorithm 3. Let $\mathcal{H}$ denote a set of cluster identifiers, with the $n$ -dimensional group assignment vector $\bm{c}$ associating elements of $\bm{q}$ with clusters, such that $c_{i}=h$ assigns $q_{i}$ to cluster $h$ . Furthermore, each cluster $h$ assumes a value $\theta_{h}$ such that $q_{i}=\theta_{c_{i}}$ .

In Step 1 of Algorithm 3, conjugacy between the Gamma-distributed base distribution $P_{0}$ and the Poisson data likelihood permits the calculation of Multinomial conditional posteriors for elements of $\bm{c}$ arising from the Chinese Restaurant Process construction. Here, the conditional posterior probability of type $i$ being assigned to group $h$ is as shown in Algorithm 3, with conjugacy permitting marginalisation with respect to the base distribution in order to calculate the probability of being assigned to a new group $h^{\star}$

[TABLE]

with $y^{\star}_{i}=\sum_{t,l}y_{itl}$ and $\lambda^{\star}_{i}=\sum_{t,l}\bm{\alpha}_{tl}^{T}(\bm{r}_{it}\odot k_{t})$

If a type is assigned to a new group, the set $\mathcal{H}$ is augmented and a corresponding cluster value is drawn from the posterior of $\theta_{h^{\star}}$ . Conversely, $\mathcal{H}$ is shrunk if a particular group becomes empty.

In Step 2, the group values are drawn from the posterior, conditional on $\bm{c}$ . The algorithm therefore alternates between updating group assignments $\bm{c}$ and group values $\bm{\theta}$ . Hence, it explores the number of groups present, the type effects assigned to each group, and the values of each group.

S7 Appendix.

Posterior correlations and non-identifiability of source attribution.

S8 Appendix.

Worked example showing features of package using simulated data.

In this section, we provide a worked example using simulated data with multiple times and locations for source attribution data generated from the model in Section 2.1 (available in the sourceR data sets). There are two times (1, 2) and two locations (A, B) over which the human cases vary. The data must be in long format, with columns giving the number of human cases for each type, a column for each of the sources giving the number of positive samples for each type, and columns giving the time, location and type id’s for each observation. Note, the source data is the same for all locations within a time.

The algorithm is run for a total of 500,000 iterations (with a burn in of 10000 iterations and thinning 500). The acceptance rates for all parameters (except those updated using a Gibbs sampler) can be accessed using the my_model$acceptance().

⬇

## source and human case data

data(sim_SA_data)

## prevalences for each source/ time

data(sim_SA_prev)

## true values for the model parameters

data(sim_SA_true)

priors <- list(a_theta = 0.01, b_theta = 0.00001,

               a_alpha = 1, a_r = 0.1)

my_model <- HaldDP$new(data = sim_SA_data, k = sim_SA_prev,

                   priors = priors, a_q = 0.1)

Fitting parameters for the McMC are be passed using

⬇

my_model$fit_params(n_iter = 1000, burn_in = 2000, thin = 500)

The model is then run using

⬇

set.seed(59623)

my_model$update()

Trace and autocorrelation plots for the parameters (Figure 8) indicate that the Markov chain is mixing well and has converged, and that thinning by 500 is adequate. The following R code demonstrates how to access and plot the marginal posteriors for some parameters.

⬇

## Plot the marginal posterior for source effect 2, time 1, location A

plot(my_model$extract(params = "alpha", times = "1", locations = "B",

 sources = "Source4")$alpha, type="l")

## Plot the marginal posterior for the type effect 21

plot(my_model $**extract**(params = "q", types = "21")$ q, type="l")

## Plot the marginal posterior for the relative prevalence of

## source effect 5, type 17, at time 2

plot(my_model$extract(params = "r", times = "2", sources = "Source5",

 types = "17")$r, type="l")

## Plot the marginal posterior for lambda_j source 1, time 1, location A

plot(my_model$extract(params = "lambda_j", times = "1", locations = "A",

 sources = "Source1")$lambda_j, type = "l")

## Plot the marginal posterior for lambda_i 10, time 2, location B

plot(my_model$extract(params = "lambda_i", times = "2", locations = "B",

 types = "10")$lambda_i, type="l")

Medians and credible intervals can be obtained for each parameter using res $summary(). The marginal density plots of the number of cases attributed to each source at each time and location ($ \lambda_{jtl} $) show that the true values (shown by a red horizontal line on the graph) are being estimated well (Figure [9](#S6.F9)). The violin plots of the number of cases attributed to each type (residual plot) for$ \lambda_{i}$ (Figure 10) shows that the model is fitting well. The heatmap shows the grouping of the type effects (Figure 11) computed using a dissimilarity matrix from the clustering output of the McMC. The coloured bar under the dendrogram gives the correct grouping from the simulated data. This shows that the majority of types have been classified correctly.

7 Acknowledgments

The research for this paper was financially supported by the Ministry for Primary Industries, the Institute of Fundamental Sciences (Massey University), the mEpiLab (Massey University), and CHICAS (Lancaster University). We acknowledge the following individuals and groups: mEpiLab (Massey University), MidCentral Public Health Services and Petra Mullner (for the Manawatu data set) and Geoff Jones (for his helpful input on automatic clustering methods).

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Havelaar AH, Kirk MD, Torgerson PR, Gibb HJ, Hald T, Lake RJ, et al. World Health Organization Global Estimates and Regional Comparisons of the Burden of Foodborne Disease in 2010. P Lo S Med. 2015;12(12):1–23. doi:10.1371/journal.pmed.1001923.
2[2] World Health Organization. WHO estimates of the global burden of foodborne diseases: foodborne disease burden epidemiology reference group 2007-2015; 2015. available on the WHO web site (www.who.int) or can be purchased from WHO Press, World Health Organization, 20 Avenue Appia, 1211 Geneva 27, Switzerland. Available from: http://apps.who.int/iris/bitstream/10665/199350/1/9789241565165_eng.pdf?ua=1 .
3[3] Crump JA, Griffin PM, Angulo FJ. Bacterial Contamination of Animal Feed and Its Relationship to Human Foodborne Illness. Clinical Infectious Diseases. 2002;35(7):859–865. doi:10.1086/342885.
4[4] Mullner P, Jones G, Noble A, Spencer S, Hathaway S, French N. Source Attribution of Food Borne Zoonoses in New Zealand: A Modified Hald Model. Risk Analysis. 2009;29(7).
5[5] Urwin R, Maiden M. Multi-locus Sequence Typing: A Tool for Global Epidemiology. Trends in Microbiology. 2003;.
6[6] Dingle K, Colles F, Wareing D, Ure R, Fox A, Bolton F, et al. Multilocus sequence typing system for Campylobacter jejuni. Journal of Clinical Microbiology. 2001;.
7[7] Allos BM, Moore MR, Griffin PM, Tauxe RV. Surveillance for Sporadic Foodborne Disease in the 21st Century: The Food Net Perspective. Clinical Infectious Diseases. 2004;38(Supplement 3):S 115–S 120. doi:10.1086/381577.
8[8] Baker M, Wilson R, Ikram R, Chambers S, Shoemack S, Cook G. Regulation of Chicken Contamination Urgently Needed to Control New Zealand’s Serious Campylobacteriosis Epidemic. The New Zealand Medical Journal. 2006;.