Compound Dirichlet Processes
Arrigo Coen, Beatriz God\'inez-Chaparro

TL;DR
This paper introduces the compound Dirichlet process and its mixture, combining renewal and Bayesian nonparametric ideas to model event times and arrival variables with flexible, interpretable Bayesian structures.
Contribution
It develops the compound Dirichlet process and mixture, providing new models with explicit posterior formulas and applications to real disease data.
Findings
Models effectively capture event timing and arrival variability.
Explicit posterior and distribution formulas derived.
Successful application to zoonotic disease data.
Abstract
The compound Poisson process and the Dirichlet process are the pillar structures of Renewal theory and Bayesian nonparametric theory, respectively. Both processes have many useful extensions to fulfill the practitioners needs to model the particularities of data structures. Accordingly, in this contribution, we joined their primal ideas to construct the compound Dirichlet process and the compound Dirichlet process mixture. As a consequence, these new processes had a fruitful structure to model the time occurrence among events, with also a flexible structure on the arrival variables. These models have a direct Bayesian interpretation of their posterior estimators and are easy to implement. We obtain expressions of the posterior distribution, nonconditional distribution and expected values. In particular to find these formulas we analyze sums of random variables with Dirichlet process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\savesymbol
AND
Compound Dirichlet Processes
Arrigo Coen
Departamento de Matemáticas, Facultad de Ciencias
Universidad Nacional Autónoma de México
México, CDMX, Apartado Postal 20-726, 01000, México
[email protected] \AndBeatriz Godínez-Chaparro
Departamento de Sistemas Biológicos, División de Ciencias Biológicas y de la Salud,
Universidad Autónoma MetropolitanaXochimilco,
Mexico City, Mexico,
[email protected] CORRESPONDING AUTHOR: Arrigo Coen, Email: [email protected]
Abstract
The compound Poisson process and the Dirichlet process are the pillar structures of Renewal theory and Bayesian nonparametric theory, respectively. Both processes have many useful extensions to fulfill the practitioners needs to model the particularities of data structures. Accordingly, in this contribution we joined their primal ideas to construct the compound Dirichlet process and the compound Dirichlet process mixture. As a consequence, these new processes had a fruitful structure to model the time occurrence among events, with also a flexible structure on the arrival variables. These models have a direct Bayesian interpretation of their posterior estimators and are easy to implement. We obtain expressions of posterior distribution, nonconditional distribution and expected values. In particular to find these formulas we analyze sums of random variables with Dirichlet process priors. We assessed our approach by applying our model on a real data example of a contagious zoonotic disease.
K****eywords Bayesian Nonparametrics Renewal theory Compound Poisson process Dirichlet process Random sums.
1 Introduction
In this contribution are presented two continuous time processes that are probabilistically constructed through a random sum, using the framework of Bayesian nonparametric models. As a consequence of its construction, these processes could be use to model renewal phenomena. Examples of applied Bayesian nonparametric models to analyze Renewal theory phenomena are presented in Bulla and Muliere (2007); Frees (1986); Xiao et al. (2015). One of the principal reasons to combine these methodologies is the fact that in many cases the renewal phenomena have complex random structures. Therefore, for these type of analysis could be better to let the data to speak by itself. By using parameter-free models, important hidden structures unveil, whereas a parametric model may conceal them. Although the combination of these branches is not new, the use of Dirichlet process that is here presented is a novel technique.
For many applied statisticians random sums models are everyday tools. An advantage of these models is that they allow us to examine the data as the contribution of simpler parts, which improve calculations and predictions. To choose a random sum model there are three key probability concepts to have in mind: 1) the law governing the number of terms to add; 2) the dependence among the terms, and; 3) the interactions between 1) and 2). For instance, in Rolski et al. (1999) these concepts are applied to model the behavior of insurance claims by taking in to account: 1) how many insurance claims are received in a fixed period of time; 2) the dependence of claims sizes, and; 3) the connections between the number of claims and their sizes (see also, Gebizlioglu and Eryilmaz (2018); Coen and Mena (2015)). Other fields where random sum models are currently applied are Multivariate analysis to model daily stock values Nadarajaha and Chanb (2019), Bayesian nonparametric theory to estimate the total number of species in a population Zhang (2005), and Finance to estimate the skewed behavior of a time series Nadarajah and Li (2017).
The classical theory of renewal processes focus on the analysis of counting process where the interarrival times are independent and identically distributed (i.i.d.). The most remarkable example of renewal process is the Poisson process, whose interarrival times are i.i.d. exponential variables Kingman (1992). By allowing some interaction among the variables, this model has been generalized to resemble more intricate phenomena. Examples of these generalizations are the Cox process, the non-homogeneous Poisson process and the Markov and Semi-Markov renewal models. A thorough analysis of these models is presented in Kovalenko and Pegg (1997).
To define our model, we will use one of the most influential Bayesian nonparametric structure, the Dirichlet process (DP) prior Ferguson (1973). The DP effectiveness is exhibit by its successful application in many statistical analysis. As pointed out by Ferguson in Ferguson (1973), two desirable properties of a prior distribution for nonparametric problems are: a large support and a manageable posterior distribution. The DP prior handles both properties in a remarkable manner, with the clear interpretation of its parameters. Moreover, the many representations of the Dirichlet process had rise diverse important Bayesian nonparametric contributions: neutral to the right processes Doksum (1974), normalized log-Gaussian processes Lenk (1988), stick breaking priors Lijoi et al. (2005); Ishwaran and James (2001), species sampling models Pitman (1996), Poisson-Kingman models Pitman (2003) and normalized random measures with independent increments Prünster et al. (2003), to mention a few. Each of these models generalize an aspect of the Dirichlet process in some direction, thus, obtaining more modeling flexibility with respect to some specific feature of the data.
In this study we are applying the Dirichlet process as a mechanism to control the probability structure of a random sum stochastic process. Under this framework, we inherit the flexibility of the DP to resemble the data behavior and have a wide spectrum of probability structures to establish as prior believes. Also, we gain an interpretation of the clustering structure of the renewals and an efficient posterior simulation algorithm. In fact, these models allow us to analyze the cluster behavior of the time and space components, induced by the discrete random measures settle to each component.
2 Compound Dirichlet Processes
In this section we define the stochastic structure of the compound Dirichlet process and the compound Dirichlet process mixture, and show some of their appealing modeling properties. These processes could be applied to phenomena where the a stochastic-time component defines the arrivals of random variables. Under this framework, we settle a dependence structure among arrivals and another among the events of the arrivals, keeping independence between each other.
Let us first consider a sequence of positive random variables , and define its renewal process as
[TABLE]
where the random variables have the interpretation as the interarrival times between events of the phenomenon of study. Then, is the number of events that take place before time . The general theory of exchangeable renewal models is studied in Coen et al. (2018), however, here we analyze the particular implications of the DP prior framework. To this end, similarly to the ideas of a compound Poisson process, we will focus our analysis on the random process given by
[TABLE]
where are independent of . In this construction we will place two exchangeable structures: one over the events and one on their inter-arrival times . The advantage of assuming this symmetric structure lies in the fact that with it we could model various dependence behaviors and, at the same time, allows the analysis of cluster formations among variables Aldous (1985). To define these dependent structures we will use Dirichlet process priors. The DP prior model is define as
[TABLE]
where denotes a Dirichlet process with precision parameter and base distribution . The DP random measure is define in Ferguson (1973) by the distributional property
[TABLE]
for all measurable partition of the sample space of , where denotes the Dirichlet distribution of -dimension with parameter . An implication of these assumptions is that the joint distribution of can be factorized using the generalized Pólya urn scheme Blackwell and MacQueen (1973), i.e. for any ,
[TABLE]
where denotes de Dirac measure at . This last expression could be interpreted as given has probability of being a new -distributed random variable independent of the past values and probability to repeat a previously seen value. This also implies that the random variables are exchangeable, meaning that the joint distribution of is equal to the distribution of , for any permutation of Aldous (1985). It follows that the variables are conditionally independent and identically distributed , with constant correlation given by
[TABLE]
A little difficulty to work directly with DP priors is that their samples are almost sure discrete random measures. Many works overcome this difficult by using a DP as a prior over the distribution of an extra layer of parameters Ferguson (1983); Lo (1984); Escobar and West (1995a). In fact, in many cases these parameters help to make the description simpler and have a direct interpretation. These models are known as the Dirichlet process mixtures models, and they are define by the structure
[TABLE]
where denotes a member of a fixed family of distributions parametrized by . Even thought, this last approach add a hidden extra layer of parameters, there are many Gibbs sampling methods to confront this issue Ishwaran and James (2001); Maceachern and Müller (1998); Neal (2000). Furthermore, the discreteness of the random measures of DP allows to study the clustering properties of the data Escobar (1994); Görür and Rasmussen (2010); Escobar and West (1995b). Under these notations we can establish a nonparametric structure on (1).
Definition 1**.**
A continuous time stochastic process given by (1) is a compound Dirichlet process (CDP) if it follows the stochastic structure
[TABLE]
where is independent of . To simplify the notation, we use .
It is important to notice that, as in the DP framework, the CDP model also has a positive probability of repeat previously seen values. In the classical DP model, as the expected number of distinct terms in grows as Korwar and Hollander (1973); it is important to notice tat this rate is smaller than . Consequently, the CDP has a positive probability of repeat increments. In other words, there is a positive probability that the increment is equal to the increment , for any positive real numbers and . Nevertheless, it is important to notices that the rate of repeated values is even smaller than the one of the DP framework. The addition operation confers a decrease in the number of repeated values; selecting different adding terms gives an extra possibility of different total results. In order to diminish the problem of repeated values and to study the clustering structure of the random variables, we have the next definition.
Definition 2**.**
A continuous time stochastic process given by (1) is a compound Dirichlet process mixture (CDPM) if it follows the stochastic structure
[TABLE]
where is independent of . We use to denote this process, where and represent parametric families of distributions.
Under definitions 1 and 2 we have a fruitful structure to consider the time evolution of the accumulation of random variables. For instance, we could use this models to analyze the claims process of an insurance company; we could establish the next methodology to analyze the Capital Requirements of the company at year Linder and Ronkainen (2004). First, we use the data of years and to fit the base distributions and . Then, we use the data of years , and as a sample to obtain the posterior distribution of the CDPM model. Finally, under Bayesian methodologies we obtain point estimators, confidence intervals, and also made hypothesis testing over the Capital Requirements. To satisfy these and other inference inquires, the next section presents statistical implications of the CDP and CDPM models.
2.1 Some properties and results of CDP and CDPM
Let us continue with some properties of the CDP and CDPM models. These results are exhibit under the CDP framework, however their implication on the CDPM models are direct. The results are arrange in order to calculate, or at least approximate, the posterior distribution of .
Theorem 1**.**
If , then for any
[TABLE]
where ,
[TABLE]
* is the -convolution of the distribution , and is the convolution of these convolutions.*
The proof of Theorem 1 is a direct consequence of the next lemma.
Lemma 1**.**
If and , we define by
[TABLE]
Then
[TABLE]
for every , where is the -convolution of the distribution . Moreover, if we define
[TABLE]
then for .
Proof.
For the sake of completeness, we present the proof presented in Coen et al. (2018) for this result. Let be the random vector indicating the repeated values in , under the following scheme: there are values that only repeats once, values that repeats twice, and so on. Then, conditioning on the distribution of can be written as
[TABLE]
The conditional distribution of given is equal to the convolution of independent variables with distribution , convolved with the convolution of variables distributed , and so on. We condition on because this eliminates the repeated values of , which allow us to consider the convolution of independent variables. Consequently, we define because given the repeated values of s we need to consider the probabilities , for and . Thus, we obtain
[TABLE]
The probabilities of are given by the Ewen’s sampling formula Ewens (1972), as
[TABLE]
by applying the generalized Pólya urn scheme (3) over the possible different values of . Finaly, the equality is a direct consequence of the definition of . ∎
Accordingly to the last results, the distribution of the CDP could be express as an infinite sum. Although we are not presenting directly the distributions of and , they can be express on terms of the distribution of , using the second statement of Lemma 1. Since Dirichlet process tends to focus most of its mass in a few atoms the convergence of the series of (4) is fast. This allow us to approximate the distribution of in two ways. We can truncate the sum (4) to a finite fixed number of terms, or we can fix a quantity to count only terms with . In both cases we are restrain the error of the approximation. Also, our computational experiments show that both approximations are stable.
Proposition 1**.**
Given , let and , for , then
[TABLE]
Proof.
The expression for follows conditioning on and using the lineality of the expectation operation. To obtain the expression for one must consider the possible repeated values of the exchangeable sequence . From now, let us assume that is a continuous distribution. Then, conditioning on , we obtain
[TABLE]
The last equality follows from conditioning to , and that . This last expression gives the result for . Likewise, the result for is obtained by conditioning to the possible repetitions of , and applying , and . Finally, in the discrete case we only need to assured that we are conditioning only on cases when the variables are equal as a consequence of the Pólya urn’s repetitions, and the formulas follow. ∎
Proposition 2**.**
Under the notation of Lemma 1, the moment generator function of is given by
[TABLE]
where denotes the moment generator function of .
Proof.
Conditioning over the possible partitions we obtain
[TABLE]
where the last equality follows since the expected value of conditioned on is the product of independent random variables equal in distribution to , each repeated times for . This gives (6) when applying (5). ∎
To see an application of (6) let us consider the Gaussian distribution case. For this base distribution, we obtain
[TABLE]
thus, we obtain that the sum of variables with prior DP and base measure Gaussian, is the mixture of Gaussian random variables. The next result shows that the CDP is a conjugate model.
Proposition 3**.**
If is a random sample of , then
[TABLE]
Proof.
The proof is immediate by applying the conjugate property of the Dirichlet process prior and the independence of with . ∎
2.2 Two examples of flexible base mesures for CDP and CDPM
Let us continue by presenting two examples of distribution families for the base distribution where the convolution of (4) simplifies. These families are the Gaussian and the phase-type distribution. Is important to notices that both had a wide support that allow to approximate other distributions. First, in the case of Gaussian distributions defined by , we obtain that , and so
[TABLE]
This implies that the density of is given by
[TABLE]
Rates of the convergence of Gaussian mixtures to the true underlying distribution are presented in Ghosal and van der Vaart (2007); Tokdar (2006). As a consequence of this convergence we could use the Gaussian model in cases with poor prior information.
For the second example we present the analytic expression for is the case of phase-type distributions. An excellent account of the theory of phase-type and matrix-exponential distributions is presented in Bladt and Nielsen (2017). An important property of this family is that it is dense on the set of positive random variables; i.e., any positive random variable can be arbitrary approximated by a phase-type distribution. We will denote by , a random variable with phase-type density given by
[TABLE]
where is a probability row vector, T a subgenerator matrix of dimension , and , with 1 the vertical vector of ones of length . Then, under this notation, if , we obtain that . By applying the convolution property of phase-type variables:
[TABLE]
for and , with . Thus,
[TABLE]
where is a matrix of dimension , given by
[TABLE]
This implies
[TABLE]
where is a matrix of dimension , given by
[TABLE]
We eliminate from the rows where . This implies that the density of is given by
[TABLE]
where . Thus, the sum of variables with DP prior and phase-type base distribution is the mixture of phase-type random variables.
3 An application to dog rabies
Rabies is one of the most severe zoonotic diseases. It is caused by a rhabdovirus in the genus Lyssavirus and infects many mammalian species. It can be transmitted through infected saliva, and it is almost fatal following the onset of clinical symptoms Webber (2009). Up to 99% of cases, domestic dogs are responsible for rabies virus transmission to humans. In Africa, an estimated 21,476 human deaths occur each year due to dog-mediated rabies, 36.4% of global human deaths World Health Organization (2018). To have effective interventionism against zoonotic infections, it is important to recognize whether infected individuals stem from a single outbreak sustained by local transmission, or from repeated introductions Cori et al. (2018); Ypma et al. (2013).
Probability models that are commonly apply to epidemiological spreads are: coupling structures, random graphs, EM-algorithm and MCMC methods. For a recent account of the theory we refer the reader to Andersson and Britton (2000) and Brauer et al. (2008). These models have dependence structures to resemble the infectious rate as a function of infected individuals in its vicinity. Another important quality of these models is to admit censored data, since often the epidemic process is only partly observed. These two properties are also found in the CDPM model. The spacial vicinity is handle directly by the posterior distribution; areas where the cases of the disease are found have a bigger probability of new cases. Censored data can be manage if the censore is in time or in the spacial component.
In Cori et al. (2018) is analyze data on dog rabies reported in Bangui, the capital of the Central African Republic between the 6th January 2003 and the 6th march 2012, they report 151 rabies cases, with information on report date, spatial coordinates, and genetic sequence of the isolated virus. The data are available in the package outbreaks of R. They applied a clustering graph model for each component and extract the most connected dots by pruning. We study this data using the CDPM model under the following assumptions. To model the time component we use an exponential mixture kernel with base distribution, and to model the spatial component we use a Gaussian mixture kernel with Gaussian-Inv-Gamma base distribution. To fit these Dirichlet process mixtures we use the Gibs sampling methodology.
Figure 1 shows that the model is able to capture the density pattern of the time component. In this figure is compared the data histogram against the posterior density estimator with its 95% confidence interval. As pointed by Cori et al. (2018), the dates of the reports are close. This is characterized by the appearance of only two mixture components in the posterior distributions. Figure 2 present the spatial cluster behavior of the posterior distribution. Even though some spatial components are missing, we could obtain the posterior estimators of the spatial data; this missing values does not affect our analysis since we are assuming independence between spatial and temporal variables. In comparison with the results of Cori et al. (2018) we obtain almost the same cluster structure.
In this application, the value of applying the CDPM model is on the estimation of future rabies outbreaks. Under our framework we obtained the complete probability structure of probable future contagious. We could answer many statistical questions through simulation using the posterior distributions of the CDPM model. For instance, we can find the probability that in the next year the number of cases doubles with respect to past year numbers. Likewise, we could obtain the spatial stochastic mobility of the disease, by locating the regions of where the disease is more concentrated. Our model allow an early assessment of infectious disease outbreaks, which is fundamental to implementing timely and effective control measures.
4 Discussion
We have proposed a simple approach for statistical analysis of renewal phenomena, which combines ideas from Renewal theory and Bayesian nonparametric theory. The proposal relies on consider two Dirichlet process priors; one on the time occurrence of event and one the arrival events. The resulting methodology is not computationally demanding and allows us to predict relatively well the evolution of renewal phenomena. Furthermore, it can be applied in cases where the cluster structure is an important factor in the analysis.
The proposed methods performs well in real spatial contexts, showing appealing features which can be used to bring closer these methodologies to practitioners in important scientific fields such as contagious analysis and general spatiotemporal analysis. Other choices of random measures, potentially lead to similar outcomes. These, more general classes priors, will be pursued elsewhere.
5 Acknowledgments
The first author is grateful to Prof. Begoña Fernández and Prof. Ramsés Mena for their valuable suggestions on an earlier version of the manuscript. This research was partially supported by a DGAPA Posdoctoral Scholarship.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aldous [1985] David J. Aldous. Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII — 1983 , pages 1–198. Springer, Berlin, Heidelberg, 1985. doi: 10.1007/B Fb 0099421 .
- 2Andersson and Britton [2000] Håkan Andersson and Tom Britton. Stochastic Epidemic Models and Their Statistical Analysis , volume 151 of Lecture Notes in Statistics . Springer New York, New York, NY, 2000. ISBN 978-0-387-95050-1. doi: 10.1007/978-1-4612-1158-7 .
- 3Blackwell and Mac Queen [1973] David Blackwell and James B. Mac Queen. Ferguson Distributions Via Polya Urn Schemes. The Annals of Statistics , 1973. ISSN 0090-5364. doi: 10.1214/aos/1176342372 .
- 4Bladt and Nielsen [2017] Mogens Bladt and Bo Friis Nielsen. Matrix-Exponential Distributions in Applied Probability , volume 81 of Probability Theory and Stochastic Modelling . Springer US, Boston, MA, 2017. ISBN 978-1-4939-7047-6. doi: 10.1007/978-1-4939-7049-0 .
- 5Brauer et al. [2008] Fred Brauer, Pauline van den Driessche, and Jianhong Wu, editors. Mathematical Epidemiology , volume 1945 of Lecture Notes in Mathematics . Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. ISBN 978-3-540-78910-9. doi: 10.1007/978-3-540-78911-6 .
- 6Bulla and Muliere [2007] Paolo Bulla and Pietro Muliere. Bayesian Nonparametric Estimation for Reinforced Markov Renewal Processes. Statistical Inference for Stochastic Processes , 10(3):283–303, may 2007. ISSN 1387-0874. doi: 10.1007/s 11203-006-9000-x .
- 7Coen and Mena [2015] Arrigo Coen and Ramsés H. Mena. Ruin probabilities for Bayesian exchangeable claims processes. Journal of Statistical Planning and Inference , 166:102–115, nov 2015. ISSN 0378-3758. doi: 10.1016/J.JSPI.2015.01.005 .
- 8Coen et al. [2018] Arrigo Coen, Luis Gutierrez, and Ramsés H. Mena. Modeling failures times with dependent renewal type models via exchangeability. Submited , 2018.
