Learning a latent pattern of heterogeneity in the innovation rates of a time series of counts
Helton Graziadei, Hedibert F. Lopes, Paulo C. Marques F

TL;DR
This paper introduces a Bayesian hierarchical semiparametric model that learns latent heterogeneity patterns in count time series, improving probabilistic forecasting, demonstrated through crime data analysis.
Contribution
It presents a novel model that captures heterogeneity in innovation rates of count time series using a Dirichlet process, enhancing forecasting accuracy.
Findings
Model effectively learns latent heterogeneity patterns.
Favorable results in crime data forecasting.
Demonstrates improved probabilistic predictions.
Abstract
We develop a Bayesian hierarchical semiparametric model for phenomena related to time series of counts. The main feature of the model is its capability to learn a latent pattern of heterogeneity in the distribution of the process innovation rates, which are softly clustered through time with the help of a Dirichlet process placed at the top of the model hierarchy. The probabilistic forecasting capabilities of the model are put to test in the analysis of crime data in Pittsburgh, with favorable results.
| Patrol Area | DP-INAR(1) | INAR(1) |
|---|---|---|
| 21 | 1.1395 | 1.1860 |
| 27 | 1.1860 | 1.3488 |
| 25 | 1.2326 | 1.3023 |
| 26 | 1.5116 | 2.0233 |
| 24 | 1.5349 | 1.6512 |
| 44 | 1.7907 | 1.8372 |
| 33 | 1.8140 | 1.9302 |
| 56 | 1.9302 | 2.0930 |
| 16 | 2.0000 | 2.0930 |
| 22 | 2.1163 | 2.2791 |
| 17 | 2.2558 | 2.2791 |
| 47 | 2.2558 | 2.3023 |
| 58 | 2.5116 | 2.9767 |
| 14 | 2.5349 | 2.5814 |
| 54 | 2.5349 | 2.8837 |
| 46 | 2.6279 | 2.7442 |
| 15 | 2.7209 | 2.7907 |
| 23 | 3.2093 | 3.3023 |
| 31 | 3.4419 | 3.4884 |
| 12 | 3.5116 | 3.9070 |
| 28 | 0.8372 | 0.8140 |
| 43 | 2.1861 | 2.1628 |
| 41 | 2.3953 | 2.3721 |
| 13 | 2.6977 | 2.6744 |
| 53 | 2.8837 | 2.8372 |
| 51 | 2.9302 | 2.8605 |
| 32 | 3.4884 | 3.4419 |
| 34 | 3.6744 | 3.5814 |
| 52 | 3.9302 | 3.8140 |
| 55 | 4.8837 | 4.5116 |
| Patrol Area | INAR(1) | DP-INAR(1) |
|---|---|---|
| 11 | 1.1429 | 1.1667 |
| 21 | 1.1667 | 1.3571 |
| 27 | 1.3095 | 1.3810 |
| 24 | 1.7381 | 1.9524 |
| 44 | 1.8333 | 1.8810 |
| 33 | 1.9524 | 2.2381 |
| 41 | 2.2619 | 2.3095 |
| 26 | 2.2857 | 2.5714 |
| 56 | 2.3095 | 2.5476 |
| 22 | 2.3333 | 2.5714 |
| 13 | 2.6190 | 2.8095 |
| 15 | 2.6667 | 2.8095 |
| 51 | 2.7857 | 2.8095 |
| 54 | 2.9286 | 3.4524 |
| 58 | 2.9762 | 3.5238 |
| 32 | 3.7143 | 3.7619 |
| 12 | 3.9286 | 4.4286 |
| 25 | 1.2619 | 1.2381 |
| 17 | 2.2143 | 2.1429 |
| 47 | 2.2857 | 2.2619 |
| 14 | 2.4048 | 2.3571 |
| 46 | 2.7143 | 2.5952 |
| 29 | 2.9762 | 2.9524 |
| 42 | 3.3571 | 3.3333 |
| 31 | 3.6905 | 3.6667 |
| 34 | 4.0000 | 3.8810 |
| 55 | 4.4762 | 4.0000 |
| 52 | 4.1667 | 4.0714 |
| Patrol Area | DP-INAR(1) | INAR(1) |
|---|---|---|
| 11 | 1.1463 | 1.1951 |
| 21 | 1.1707 | 1.4146 |
| 25 | 1.2683 | 1.2927 |
| 27 | 1.3659 | 1.4146 |
| 24 | 1.8293 | 1.9512 |
| 33 | 2.1220 | 2.3171 |
| 17 | 2.1707 | 2.1951 |
| 22 | 2.3171 | 2.5610 |
| 56 | 2.3415 | 2.6341 |
| 26 | 2.4390 | 2.7073 |
| 13 | 2.6585 | 2.9024 |
| 15 | 2.7561 | 2.9024 |
| 53 | 2.9024 | 2.9756 |
| 58 | 3.0000 | 3.6341 |
| 54 | 3.0244 | 3.3902 |
| 23 | 3.4146 | 3.5366 |
| 31 | 3.5610 | 3.5854 |
| 32 | 3.7805 | 3.9024 |
| 12 | 4.0732 | 4.4634 |
| 44 | 1.9268 | 1.9024 |
| 43 | 2.3171 | 2.2683 |
| 14 | 2.5610 | 2.4878 |
| 45 | 2.6098 | 2.5854 |
| 46 | 2.6098 | 2.5366 |
| 52 | 4.0244 | 3.9268 |
| 34 | 4.1707 | 4.1463 |
| 55 | 4.2927 | 4.1951 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Methods and Inference · Stochastic processes and statistical mechanics
Learning a latent pattern of heterogeneity in the
innovation rates of a time series of counts
Helton Graziadei Hedibert F. Lopes Paulo C. Marques F.
USP - São Paulo Insper - São Paulo Insper - São Paulo
(July 2019)
Abstract
We develop a Bayesian hierarchical semiparametric model for phenomena related to time series of counts. The main feature of the model is its capability to learn a latent pattern of heterogeneity in the distribution of the process innovation rates, which are softly clustered through time with the help of a Dirichlet process placed at the top of the model hierarchy. The probabilistic forecasting capabilities of the model are put to test in the analysis of crime data in Pittsburgh, with favorable results.
1 Introduction
Time series of counts are associated with a multiplicity of phenomena in fields as diverse as epidemiology, econometrics, finance, environmental studies, and public policy [1]. In this paper, we develop a model for this kind of data, taking into account the possible existence of heterogeneities in the process distribution as it evolves through time.
In a general setting, we consider a Markovian process, for which the current count is modeled as some functional of the count at the previous epoch, plus a stochastic innovation, whose expectation may be specific to the current epoch. Modeling these unobservable innovation rates hierarchically in a suitable way, we can learn a latent pattern of heterogeneity in their distribution, and this information can be incorporated in the forecasting of future counts.
Our investigation implements this general idea in a particular setting built from two main components. Initially, we generalize the first-order integer autoregressive model (INAR(1) model hereafter), introduced in the seminal papers of McKenzie [2] and Al-Osh and Alzaid [3], allowing for different values of the innovation rates at different times. Subsequently, this generalized model is extended hierarchically, with the help of a Dirichlet process [4]. This gives us a semiparametric model, which, due to the properties of the Dirichlet process, is capable of clustering the values of the innovation rates through time, based on the information contained in the observed counts, thereby allowing us to identify different innovation regimes in the time evolution of the process. Forecasting within this probabilistic model is made straightforwardly through the appropriate posterior predictive distributions.
The paper is organized as follows. In Section 2, we generalize the original INAR(1) model, allowing for distinct innovation rates at different epochs of the process. In Section 3, this generalized INAR(1) model is data augmented, leading to a conditional specification of the model which enables the derivation of simple forms for the model parameters and latent variables full conditional distributions. The necessary Dirichlet process definitions and properties are briefly reviewed in Section 4. In Section 5, we set up the DP-INAR(1) model introducing a Dirichlet process at the top of the hierarchy developed in Section 2. After the forms of the prior distributions have been specified, we derive simple closed forms for the full conditional distributions of the model parameters and latent variables. We carefully consider the choice of prior parameters in Section 6. In Section 7, we show how to use the DP-INAR(1) model to do Bayesian forecasting. In Section 8, we put all the analytical results to work in the forecasting of crime data in Pittsburgh, US. In this application, the DP-INAR(1) model outperforms the original INAR(1) model in the majority of the patrol areas.
2 Generalized INAR(1) model
We begin by generalizing the original INAR(1) model of McKenzie [2] and Al-Osh and Alzaid [3] as follows.
Let be an integer-valued time series, and let the innovations , given positive parameters , be a sequence of conditionally independent random variables. Given a parameter , let be a family of conditionally independent and identically distributed random variables. Furthermore, given all the parameters, assume that the innovations and the family are conditionally independent. The generalized INAR(1) model is defined by the functional relation
[TABLE]
for , in which denotes the binomial thinning operator, defined by , if , and , if . In the homogeneous case, when all the ’s are assumed to be equal, we recover the original INAR(1) model.
This model can be interpreted as specifying a birth-and-death process, in which, at epoch , the number of cases is equal to the new cases plus the cases that survived from the previous epoch; the role of the binomial thinning operator being to remove a random number of the cases present at the previous epoch .
Let denote the values of an observed time series. For simplicity, we assume that with probability one. Since the process is Markovian, the joint distribution of , given parameters and , can be factored as
[TABLE]
Since, with probability one, and , by the law of total probability and the definition of the generalized INAR(1) model we have that
[TABLE]
Hence, the generalized INAR(1) model likelihood function is given by
[TABLE]
In the next section, we show how the introduction of certain latent (unobservable) random variables allows us to specify the generalized INAR(1) model in terms of a set of conditional distributions. This alternative representation leads to a factorization of the model joint distribution which is the key element propelling our Monte Carlo simulations.
3 Data augmentation
In the generalized INAR(1) model, suppose that, in addition to the values of the counts , we could observe the values of the maturations . The ’s would tell us the number of cases that matured (survived) from the previous epoch, breaking down into two parcels: maturations plus innovations.
This is an example of data augmentation [5, 6], in which the introduction of the unobservable maturations, with suitable conditional distributions, factors the model into more manageable pieces. Within this data augmentation scheme, we postulate that
[TABLE]
and
[TABLE]
in which denotes the indicator function of the set , defined by , if , and , if .
Using the law of total probability and the product rule, we have that
[TABLE]
in which, following the data augmentation scheme, we took advantage of the appropriate conditional independences.
Since
[TABLE]
comparing the expression above for with the results in the previous section, we come to the conclusion that this is a valid data augmentation scheme, since it induces the same generalized INAR(1) model likelihood function.
In the next section, we recollect the main definitions and results related to the Dirichlet Process which are necessary to build-up our semiparametric hierarchical model. The data augmentation scheme developed above will come in handy in the derivation of the full conditional distributions of the complete model.
4 The Dirichlet process
Suppose that we represent our uncertainties about quantities assuming values in a sampling space , with sigma-field , by means of an underlying probability space .
A mapping is a random probability measure if is a probability measure over , for every , and is a random variable, for each .
Ferguson [4] defined a random probability measure descriptively as follows. Let be a finite nonzero measure over and postulate that for each -measurable partition of the random vector has the ordinary Dirichlet distribution with parameters . In this case, we say that is a Dirichlet process with base measure , and use the notation . Ferguson proved that is a properly defined random process in the sense of Kolmogorov’s consistency theorem.
Defining the concentration parameter , and the base probability measure by , it follows from the usual properties of the Dirichlet distribution that and , for every . Therefore, is centered on , and controls the concentration of around . In terms of the concentration parameter and the base probability measure, we write .
Inference with the Dirichlet process is tractable. In particular, Ferguson proved that the Dirichlet process is closed under sampling: if are conditionally independent and identically distributed, given , such that , for every in , then
[TABLE]
Notice that, using the law of total expectation, we have
[TABLE]
almost surely, for every in , in which the second equality follows from the conditional independence of the ’s. Hence, the posterior predictive distribution is
[TABLE]
This expression of the posterior predictive distribution unleashes important features of the Dirichlet process, thereby showing how it can be used as a modeling tool. In particular, it defines a data generating process known as the Pólya-Blackwell-MacQueen urn [7]. If we imagine the sequential generation of the ’s, for , we see that a value is generated anew from with probability proportional to , or we repeat one the previously generated values with probability proportional to its multiplicity. This shows that, almost surely, realizations of a Dirichlet process are discrete probability measures, maybe with denumerably infinite support, depending on the nature of . Also, this data generating process associated with the Pólya-Blackwell-MacQueen urn implies that the ’s are clustered, which is the main feature of the Dirichlet process that we rely on to build our semiparametric model. Antoniak [8] derived the conditional distribution of the number of distinct ’s, that is, the number of clusters , given the concentration parameter , as
[TABLE]
in which denotes the unsigned Stirling number of the first kind.
In the next section, we place a Dirichlet process at the top of the hierarchy of the generalized INAR(1) model, completing the specification of our semiparametric model, thereby being able to represent our uncertainty about the values of the unobservable innovation rates ’s, given the information contained in the observed counts. In doing so, we benefit from the clustering properties of the Dirichlet process described above, identifying different regimes for the innovation rates as the process evolves through time.
5 DP-INAR(1) model
The DP-INAR(1) model completes the generalized INAR(1) model defined in Section 2, placing a Dirichlet process at the top of the hierarchy. Formally, we model the innovation rates , given , as conditionally independent and identically distributed, with , for every Borel set . The prior distributions for and are and , respectively. The base probability measure is a distribution. In Section 6, we discuss in detail the choice of prior parameters.
Figure 1 displays a graphical representation of the DP-INAR(1) model. In the graph, absence of an arrow connecting two random objects means that they are conditionally independent given their parents (see [9] for a witful discussion of graphical models).
Our next step is to derive the full conditional distributions for all latent variables and model parameters. For convenience, we adopt a simplified notation in the following derivations, using the same letter to denote different probability functions or densities, with distinctions made clear from the context.
Define , and let denote the distribution of . Marginalizing on the graph, we have
[TABLE]
In this expression, the last integral is the joint distribution , pointing out that the random vector has an exchangeable distribution. Due to this distributional symmetry and the product rule, we can always make depend on a certain only through , in which denotes the vector with the component removed. Using the symbol to denote proportionality up to a suitable normalization factor, and the label “all others” to designate the observed counts , and all the other latent variables and model parameters, with the exception of the one under consideration, we have that
[TABLE]
Therefore, the Pólya-Blackwell-MacQueen urn process yields the full conditional distribution of as the mixture
[TABLE]
in which denotes a point mass at . In the former expression we suppressed the normalization constant which makes all mixture weights add up to one.
The derivations of the full conditionals for and are straightforward.
[TABLE]
[TABLE]
West [10] shows how to derive the full conditional distribution of the concentration parameter in simple closed form, after the introduction of an auxiliary random variable . Using this technique, we have the full conditionals
[TABLE]
[TABLE]
in which we suppressed the normalization constant which makes the two mixture weights add up to one.
These full conditional distributions allow us to explore the model posterior distribution by coding a plain Gibbs sampler [11]. Experimentation with this Gibbs sampler shows that, as pointed out by Escobar and West [12] in a similar context, we can improve mixing by resampling simultaneously the values of all ’s inside the same cluster at the end of each iteration. Formally, let be the unique values among and define the number of occupants of cluster by . It follows that
[TABLE]
for . After the ’s are sampled from this distribution, we update the values of all ’s inside each cluster by the corresponding .
In the next section, we discuss how to choose the prior parameters for the DP-INAR(1) model.
6 Choice of prior parameters
Extending the original scheme proposed by Dorazio [13], we choose the parameters and of the prior by minimizing the Kullback-Leibler divergence between the prior distribution of the number of clusters and a uniform discrete distribution on a suitable range. Using the results in Section 4, the marginal probability function of can be computed as
[TABLE]
for , in which
[TABLE]
Using the information available about the phenomena under consideration to make a sensible choice for the integers and , and letting be the probability function of a uniform discrete distribution on , that is
[TABLE]
we find, by numerical integration and optimization, the values of and that minimize the Kullback-Leibler divergence
[TABLE]
We choose the parameters and of the base probability density in a similar fashion, minimizing the Kullback-Leibler divergence between and a uniform distribution on a suitable range , in which is chosen by taking into consideration the available information on the studied phenomena. Letting be a uniform density on , that is
[TABLE]
we find, by numerical optimization, the values of and that minimize the Kullback-Leibler divergence
[TABLE]
Choosing the parameters for the prior is more straightforward, with being a natural choice.
7 Bayesian forecasting
The Gibbs sampler described in Section 5 yields, marginally, a sample from the posterior distribution. Note that, for , we can obtain the number of clusters as the number of distinct entries in the vector . Uncertainty about future counts is represented by the -steps-ahead posterior predictive distribution
[TABLE]
for some target . In particular, a pointwise forecast is obtained as a suitable summary of this posterior predictive distribution.
Using the law of total probability, the product rule, and simplifying the conditional independences in the model, we can write the posterior predictive probability function as
[TABLE]
A nice property of the DP-INAR(1) model is that we can derive a simple analytical expression for the first factor in the integrand above.
Proposition 7.1**.**
The probability function of , given , , and , can be writen as the convolution of a distribution and a ) distribution,
[TABLE]
in which
[TABLE]
Proof.
We prove the result by induction. For , using a simplified notation, the conditional moment generating function is given by
[TABLE]
But since is a sequence of conditionally independent random variables, which is also conditionally independent of , we have that
[TABLE]
which is the product of the generating functions of a random variable and a random variable. Now, suppose the result holds for an arbitrary . Then,
[TABLE]
in which we defined . Consequently, from the induction hypothesis, we have that
[TABLE]
in which . Hence, the result holds for , completing the proof. ∎
Using the Pólya-Blackwell-MacQueen urn process repeatedly, for , we draw a sample from sequentially as follows:
[TABLE]
Combining all these elements, we approximate the integral representation of the -steps-ahead posterior predictive probability function by the Monte Carlo average
[TABLE]
for .
As a pointwise forecast , we compute the generalized median of the -steps-ahead posterior predictive distribution, defined by
[TABLE]
We use a form of cross-validation to evaluate the forecasting performance of the model. For an observed time series , we pick some , and treat the counts as a holdout (test) sample. For , we train the model conditioning only on the values and making an -steps-ahead prediction . To score the forecast performance, we average the median deviations over all predictions. This cross-validation procedure is depicted in Figure 2.
In the next section, we assess the forecasting performance of the DP-INAR(1) model, analyzing monthly time series of burglary occurrences in Pittsburgh, USA.
8 Pittsburgh crime data
In this section, we analyze monthly time series of burglary events in Pittsburgh, USA, from January to December [14]. In this dataset, each time series has a length of months and corresponds to a certain patrol area.
Figure 3 presents the time series for patrol area , which displays substantial time heterogeneity and variation in the monthly counts of burglary events. In what follows, we use this patrol area to exemplify the model training procedure.
To determine the hyperparameters and , the optimization procedure described in Section 6, with and , yields and .
Using the procedure discussed in Section 6, we control the support of by choosing the value of to be the maximum observed count. Figure 4 displays the level curves of . The minimum is attained at and .
For the thinning parameter , we adopt a uniform prior, choosing .
The marginal posterior distributions of parameters , , , and are displayed in Figure 5. The posterior distribution of the thinning parameter is reasonably concentrated, with posterior mean , showing that the autoregressive component is not negligible. The posterior distributions of , and are fairly concentrated as well, with posterior means equal to , and , respectively, showing that different regimes of innovation rates were captured in the learning process. The Markov chains in Figure 6 indicate that proper mixing is achieved by the Gibbs sampler.
Figure 7 shows both the prior and posterior distributions of the number of clusters . While the prior distribution is reasonably flat in the range to , the posterior distribution is concentrated around , the posterior mode.
With regard to the forecasting performance within this dataset, Tables LABEL:tab:mae_1, LABEL:tab:mae_2, and LABEL:tab:mae_3 present the mean absolute deviations for one, two and three-steps-ahead forecasts for both the DP-INAR(1) and the original INAR(1) model. In these tables, the mean absolute deviations are computed predicting the values of the last , , and months of each time series, according to the desired number of steps ahead, using the cross-validation procedure described in Section 7. The tables show that the DP-INAR(1) model outperforms the INAR(1) in the majority of the districts.
Acknowledgements
Helton Graziadei and Hedibert F. Lopes thank FAPESP for financial support through grants numbers 2017/10096-6 and 2017/22914-5.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. Weiß, An introduction to discrete-valued time series . John Wiley & Sons, 2018.
- 2[2] E. Mc Kenzie, “Some simple models for discrete variate time series,” Journal of the American Water Resources Association , vol. 21, no. 4, pp. 645–650, 1985.
- 3[3] M. Al-Osh and A. Alzaid, “First-order integer-valued autoregressive (INAR(1)) process: distributional and regression properties,” Statistica Neerlandica , vol. 42, pp. 53–61, 1988.
- 4[4] T. Ferguson, “A Bayesian analysis of some nonparametric problems,” The Annals of Statistics , vol. 1, no. 2, pp. 209–230, 1973.
- 5[5] M. Tanner and W. Wong, “The calculation of posterior distributions by data augmentation,” Journal of the American Statistical Association , vol. 82, no. 398, pp. 528–540, 1987.
- 6[6] D. Van Dyk and X.-L. Meng, “The art of data augmentation,” Journal of Computational and Graphical Statistics , vol. 10, no. 1, pp. 1–50, 2001.
- 7[7] D. Blackwell and J. Mac Queen, “Ferguson distributions via Pólya urn schemes,” The Annals of Statistics , vol. 1, no. 2, pp. 353–355, 1973.
- 8[8] C. Antoniak, “Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems,” The Annals of Statistics , vol. 2, no. 6, pp. 1152–1174, 1974.
