Learning a latent pattern of heterogeneity in the innovation rates of a   time series of counts

Helton Graziadei; Hedibert F. Lopes; Paulo C. Marques F

arXiv:1907.03155·stat.ME·July 9, 2019

Learning a latent pattern of heterogeneity in the innovation rates of a time series of counts

Helton Graziadei, Hedibert F. Lopes, Paulo C. Marques F

PDF

Open Access

TL;DR

This paper introduces a Bayesian hierarchical semiparametric model that learns latent heterogeneity patterns in count time series, improving probabilistic forecasting, demonstrated through crime data analysis.

Contribution

It presents a novel model that captures heterogeneity in innovation rates of count time series using a Dirichlet process, enhancing forecasting accuracy.

Findings

01

Model effectively learns latent heterogeneity patterns.

02

Favorable results in crime data forecasting.

03

Demonstrates improved probabilistic predictions.

Abstract

We develop a Bayesian hierarchical semiparametric model for phenomena related to time series of counts. The main feature of the model is its capability to learn a latent pattern of heterogeneity in the distribution of the process innovation rates, which are softly clustered through time with the help of a Dirichlet process placed at the top of the model hierarchy. The probabilistic forecasting capabilities of the model are put to test in the analysis of crime data in Pittsburgh, with favorable results.

Tables3

Table 1. Table 1: Mean absolute deviations for one-step-ahead predictions. The first block is formed by police patrol areas in which the DP-INAR(1) model outperforms the INAR(1) model. The DP-INAR(1) model produces lower mean absolute deviations in 67 % percent 67 67\% of the patrol areas.

Patrol Area	DP-INAR(1)	INAR(1)
21	1.1395	1.1860
27	1.1860	1.3488
25	1.2326	1.3023
26	1.5116	2.0233
24	1.5349	1.6512
44	1.7907	1.8372
33	1.8140	1.9302
56	1.9302	2.0930
16	2.0000	2.0930
22	2.1163	2.2791
17	2.2558	2.2791
47	2.2558	2.3023
58	2.5116	2.9767
14	2.5349	2.5814
54	2.5349	2.8837
46	2.6279	2.7442
15	2.7209	2.7907
23	3.2093	3.3023
31	3.4419	3.4884
12	3.5116	3.9070
28	0.8372	0.8140
43	2.1861	2.1628
41	2.3953	2.3721
13	2.6977	2.6744
53	2.8837	2.8372
51	2.9302	2.8605
32	3.4884	3.4419
34	3.6744	3.5814
52	3.9302	3.8140
55	4.8837	4.5116

Table 2. Table 2: Mean absolute deviations for two-step-ahead predictions. The first block is formed by police patrol areas in which the DP-INAR(1) model outperforms the INAR(1) model. The DP-INAR(1) model produces lower mean absolute deviations in 61 % percent 61 61\% of the areas.

Patrol Area	INAR(1)	DP-INAR(1)
11	1.1429	1.1667
21	1.1667	1.3571
27	1.3095	1.3810
24	1.7381	1.9524
44	1.8333	1.8810
33	1.9524	2.2381
41	2.2619	2.3095
26	2.2857	2.5714
56	2.3095	2.5476
22	2.3333	2.5714
13	2.6190	2.8095
15	2.6667	2.8095
51	2.7857	2.8095
54	2.9286	3.4524
58	2.9762	3.5238
32	3.7143	3.7619
12	3.9286	4.4286
25	1.2619	1.2381
17	2.2143	2.1429
47	2.2857	2.2619
14	2.4048	2.3571
46	2.7143	2.5952
29	2.9762	2.9524
42	3.3571	3.3333
31	3.6905	3.6667
34	4.0000	3.8810
55	4.4762	4.0000
52	4.1667	4.0714

Table 3. Table 3: Mean absolute deviations based on three-step-ahead predictions. The first block is formed by police patrol areas in which the DP-INAR(1) model outperforms the INAR(1) model. The DP-INAR(1) model produces lower mean absolute deviations in 70 % percent 70 70\% of the areas.

Patrol Area	DP-INAR(1)	INAR(1)
11	1.1463	1.1951
21	1.1707	1.4146
25	1.2683	1.2927
27	1.3659	1.4146
24	1.8293	1.9512
33	2.1220	2.3171
17	2.1707	2.1951
22	2.3171	2.5610
56	2.3415	2.6341
26	2.4390	2.7073
13	2.6585	2.9024
15	2.7561	2.9024
53	2.9024	2.9756
58	3.0000	3.6341
54	3.0244	3.3902
23	3.4146	3.5366
31	3.5610	3.5854
32	3.7805	3.9024
12	4.0732	4.4634
44	1.9268	1.9024
43	2.3171	2.2683
14	2.5610	2.4878
45	2.6098	2.5854
46	2.6098	2.5366
52	4.0244	3.9268
34	4.1707	4.1463
55	4.2927	4.1951

Equations95

Y_{t} = α \circ Y_{t - 1} + Z_{t},

Y_{t} = α \circ Y_{t - 1} + Z_{t},

Pr {Y_{1} = y_{1}, \dots, Y_{T} = y_{T} ∣ α, λ} = t = 2 \prod T Pr {Y_{t} = y_{t} ∣ Y_{t - 1} = y_{t - 1}, α, λ_{t}} .

Pr {Y_{1} = y_{1}, \dots, Y_{T} = y_{T} ∣ α, λ} = t = 2 \prod T Pr {Y_{t} = y_{t} ∣ Y_{t - 1} = y_{t - 1}, α, λ_{t}} .

Pr {Y_{t} = y_{t}

Pr {Y_{t} = y_{t}

\displaystyle=\Pr\!\left\{\sum_{i=1}^{Y_{t-1}}B_{i}(t)+Z_{t}=y_{t}\,\;\Bigg{|}\;\,Y_{t-1}=y_{t-1},\alpha,\lambda_{t}\right\}

\displaystyle=\sum_{m_{t}=0}^{\min\{y_{t},\,y_{t-1}\}}\Pr\!\left\{\sum_{i=1}^{y_{t-1}}B_{i}(t)=m_{t},Z_{t}=y_{t}-m_{t}\,\;\Bigg{|}\;\,\alpha,\lambda_{t}\right\}

\displaystyle=\sum_{m_{t}=0}^{\min\{y_{t},\,y_{t-1}\}}\Pr\!\left\{\sum_{i=1}^{y_{t-1}}B_{i}(t)=m_{t}\,\;\Bigg{|}\;\,\alpha\right\}\Pr\{Z_{t}=y_{t}-m_{t}\mid\lambda_{t}\}.

L_{y} (α, λ) = t = 2 \prod T m_{t} = 0 \sum m i n {y_{t - 1}, y_{t}} (m _{t} y _{t - 1}) α^{m_{t}} (1 - α)^{y_{t - 1} - m_{t}} (\frac{e ^{- λ_{t}} λ _{t}^{y_{t} - m_{t}}}{( y _{t} - m _{t} )!}) .

L_{y} (α, λ) = t = 2 \prod T m_{t} = 0 \sum m i n {y_{t - 1}, y_{t}} (m _{t} y _{t - 1}) α^{m_{t}} (1 - α)^{y_{t - 1} - m_{t}} (\frac{e ^{- λ_{t}} λ _{t}^{y_{t} - m_{t}}}{( y _{t} - m _{t} )!}) .

M_{t} ∣ α, Y_{t - 1} = y_{t - 1} \sim Binomial (y_{t - 1}, α),

M_{t} ∣ α, Y_{t - 1} = y_{t - 1} \sim Binomial (y_{t - 1}, α),

Pr {Y_{t} = y_{t} ∣ M_{t} = m_{t}, λ_{t}} = \frac{e ^{- λ_{t}} λ _{t}^{y_{t} - m_{t}}}{( y _{t} - m _{t} )!} I_{{m_{t}, m_{t + 1}, \dots}} (y_{t}),

Pr {Y_{t} = y_{t} ∣ M_{t} = m_{t}, λ_{t}} = \frac{e ^{- λ_{t}} λ _{t}^{y_{t} - m_{t}}}{( y _{t} - m _{t} )!} I_{{m_{t}, m_{t + 1}, \dots}} (y_{t}),

Pr {Y_{t}

Pr {Y_{t}

= m_{t} = 0 \sum y_{t - 1} Pr {Y_{t} = y_{t} ∣ M_{t} = m_{t}, λ_{t}} \times Pr {M_{t} = m_{t} ∣ Y_{t - 1} = y_{t - 1}, α},

I_{{m_{t}, m_{t + 1}, \dots}} (y_{t}) \times I_{{0, 1, \dots, y_{t - 1}}} (m_{t})

I_{{m_{t}, m_{t + 1}, \dots}} (y_{t}) \times I_{{0, 1, \dots, y_{t - 1}}} (m_{t})

= I_{{0, 1, \dots, m i n {y_{t}, y_{t - 1}}}} (m_{t}),

G ∣ X_{1} = x_{1}, \dots, X_{n} = x_{n} \sim DP ((τ + n) (\frac{τ}{τ + n} G_{0} + \frac{1}{τ + n} i = 1 \sum n I_{B} (x_{i}))) .

G ∣ X_{1} = x_{1}, \dots, X_{n} = x_{n} \sim DP ((τ + n) (\frac{τ}{τ + n} G_{0} + \frac{1}{τ + n} i = 1 \sum n I_{B} (x_{i}))) .

Pr {X_{n + 1} \in B ∣ X_{1}, \dots, X_{n}}

Pr {X_{n + 1} \in B ∣ X_{1}, \dots, X_{n}}

= E [Pr {X_{n + 1} \in B ∣ G} ∣ X_{1}, \dots, X_{n}]

= E [G (B) ∣ X_{1}, \dots, X_{n}],

Pr {X_{n + 1} \in B ∣ X_{1} = x_{1}, \dots, X_{n} = x_{n}} = \frac{τ}{τ + n} G_{0} (B) + \frac{1}{τ + n} i = 1 \sum n I_{B} (x_{i}) .

Pr {X_{n + 1} \in B ∣ X_{1} = x_{1}, \dots, X_{n} = x_{n}} = \frac{τ}{τ + n} G_{0} (B) + \frac{1}{τ + n} i = 1 \sum n I_{B} (x_{i}) .

Pr {K = k ∣ τ} = S (n, k) τ^{k} \frac{Γ ( τ )}{Γ ( τ + n )} I_{{1, 2, \dots, n}} (k),

Pr {K = k ∣ τ} = S (n, k) τ^{k} \frac{Γ ( τ )}{Γ ( τ + n )} I_{{1, 2, \dots, n}} (k),

p (y, m, α, λ)

p (y, m, α, λ)

\displaystyle=\Bigg{\{}\prod_{t=2}^{T}p(y_{t}\mid m_{t},\lambda_{t})\ p(m_{t}\mid y_{t-1},\alpha)\Bigg{\}}\times\pi(\alpha)\times\int\prod_{t=2}^{T}p(\lambda_{t}\mid G)\,d\mu_{\mathbb{G}}(G).

p (λ_{t} ∣ all others) \propto p (y, m, α, λ) \propto p (λ_{t} ∣ λ_{∖ t}) p (y_{t} ∣ m_{t}, λ_{t}) \propto e^{- λ_{t}} λ_{t}^{y_{t} - m_{t}} p (λ_{t} ∣ λ_{∖ t}) .

p (λ_{t} ∣ all others) \propto p (y, m, α, λ) \propto p (λ_{t} ∣ λ_{∖ t}) p (y_{t} ∣ m_{t}, λ_{t}) \propto e^{- λ_{t}} λ_{t}^{y_{t} - m_{t}} p (λ_{t} ∣ λ_{∖ t}) .

λ_{t} ∣ all others

λ_{t} ∣ all others

+ r \neq = t \sum λ_{r}^{y_{t} - m_{t}} e^{- λ_{r}} δ_{{λ_{r}}},

α ∣ all others \sim Beta (a_{0}^{(α)} + t = 2 \sum T m_{t}, b_{0}^{(α)} + t = 2 \sum T (y_{t - 1} - m_{t})) .

α ∣ all others \sim Beta (a_{0}^{(α)} + t = 2 \sum T m_{t}, b_{0}^{(α)} + t = 2 \sum T (y_{t - 1} - m_{t})) .

p(m_{t}\mid\text{all others})\boldsymbol{\propto}\frac{1}{m_{t}!(y_{t}-m_{t})!(y_{t-1}-m_{t})!}\Bigg{(}\frac{\alpha}{\lambda_{t}(1-\alpha)}\Bigg{)}^{m_{t}}\mathbb{I}_{\{0,1,\ldots,\min{\{y_{t-1},y_{t}\}}\}}(m_{t}).

p(m_{t}\mid\text{all others})\boldsymbol{\propto}\frac{1}{m_{t}!(y_{t}-m_{t})!(y_{t-1}-m_{t})!}\Bigg{(}\frac{\alpha}{\lambda_{t}(1-\alpha)}\Bigg{)}^{m_{t}}\mathbb{I}_{\{0,1,\ldots,\min{\{y_{t-1},y_{t}\}}\}}(m_{t}).

U ∣ all others \sim Beta (τ + 1, T - 1);

U ∣ all others \sim Beta (τ + 1, T - 1);

τ ∣ all others

τ ∣ all others

+ \frac{( T - 1 ) \cdot Γ ( a _{0}^{(τ)} + k - 1 )}{( b _{0}^{(τ)} - lo g u ) ^{a_{0}^{(τ)} + k - 1}} \times Gamma (a_{0}^{(τ)} + k - 1, b_{0}^{(τ)} - lo g u),

λ_{j}^{*} ∣ all others \sim Gamma (a_{0}^{(G_{0})} + t = 2 \sum T (y_{t} - m_{t}) \cdot I_{{λ_{j}^{*}}} (λ_{t}), b_{0}^{(G_{0})} + n_{j}) .

λ_{j}^{*} ∣ all others \sim Gamma (a_{0}^{(G_{0})} + t = 2 \sum T (y_{t} - m_{t}) \cdot I_{{λ_{j}^{*}}} (λ_{t}), b_{0}^{(G_{0})} + n_{j}) .

π (k) = \int_{0}^{\infty} Pr {K = k ∣ τ} π (τ) d τ = \frac{b _{0}^{(τ)} S ( T - 1 , k )}{Γ ( a _{0}^{(τ)} )} I (a_{0}^{(τ)}, b_{0}^{(τ)}; k),

π (k) = \int_{0}^{\infty} Pr {K = k ∣ τ} π (τ) d τ = \frac{b _{0}^{(τ)} S ( T - 1 , k )}{Γ ( a _{0}^{(τ)} )} I (a_{0}^{(τ)}, b_{0}^{(τ)}; k),

I (a_{0}^{(τ)}, b_{0}^{(τ)}; k) = \int_{0}^{\infty} \frac{τ ^{k + a_{0}^{(τ)} - 1} e ^{- b_{0}^{(τ)} τ} Γ ( τ )}{Γ ( τ + T - 1 )} d τ .

I (a_{0}^{(τ)}, b_{0}^{(τ)}; k) = \int_{0}^{\infty} \frac{τ ^{k + a_{0}^{(τ)} - 1} e ^{- b_{0}^{(τ)} τ} Γ ( τ )}{Γ ( τ + T - 1 )} d τ .

q (k) = \frac{1}{( k _{max} - k _{min} + 1 )} I_{{k_{min}, \dots, k_{max}}} (k),

q (k) = \frac{1}{( k _{max} - k _{min} + 1 )} I_{{k_{min}, \dots, k_{max}}} (k),

KL [π

KL [π

= (constant) + lo g Γ (a_{0}^{(τ)}) - a_{0}^{(τ)} lo g b_{0}^{(τ)} - \frac{1}{( k _{max} - k _{min} + 1 )} k = k_{min} \sum k_{max} lo g I (a_{0}^{(τ)}, b_{0}^{(τ)}; k) .

h (λ) = (\frac{1}{λ _{max}}) I_{[0, λ_{max}]} (λ),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Statistical Methods and Inference · Stochastic processes and statistical mechanics

Full text

Learning a latent pattern of heterogeneity in the

innovation rates of a time series of counts

Helton Graziadei Hedibert F. Lopes Paulo C. Marques F.

USP - São Paulo Insper - São Paulo Insper - São Paulo

(July 2019)

Abstract

We develop a Bayesian hierarchical semiparametric model for phenomena related to time series of counts. The main feature of the model is its capability to learn a latent pattern of heterogeneity in the distribution of the process innovation rates, which are softly clustered through time with the help of a Dirichlet process placed at the top of the model hierarchy. The probabilistic forecasting capabilities of the model are put to test in the analysis of crime data in Pittsburgh, with favorable results.

1 Introduction

Time series of counts are associated with a multiplicity of phenomena in fields as diverse as epidemiology, econometrics, finance, environmental studies, and public policy [1]. In this paper, we develop a model for this kind of data, taking into account the possible existence of heterogeneities in the process distribution as it evolves through time.

In a general setting, we consider a Markovian process, for which the current count is modeled as some functional of the count at the previous epoch, plus a stochastic innovation, whose expectation may be specific to the current epoch. Modeling these unobservable innovation rates hierarchically in a suitable way, we can learn a latent pattern of heterogeneity in their distribution, and this information can be incorporated in the forecasting of future counts.

Our investigation implements this general idea in a particular setting built from two main components. Initially, we generalize the first-order integer autoregressive model (INAR(1) model hereafter), introduced in the seminal papers of McKenzie [2] and Al-Osh and Alzaid [3], allowing for different values of the innovation rates at different times. Subsequently, this generalized model is extended hierarchically, with the help of a Dirichlet process [4]. This gives us a semiparametric model, which, due to the properties of the Dirichlet process, is capable of clustering the values of the innovation rates through time, based on the information contained in the observed counts, thereby allowing us to identify different innovation regimes in the time evolution of the process. Forecasting within this probabilistic model is made straightforwardly through the appropriate posterior predictive distributions.

The paper is organized as follows. In Section 2, we generalize the original INAR(1) model, allowing for distinct innovation rates at different epochs of the process. In Section 3, this generalized INAR(1) model is data augmented, leading to a conditional specification of the model which enables the derivation of simple forms for the model parameters and latent variables full conditional distributions. The necessary Dirichlet process definitions and properties are briefly reviewed in Section 4. In Section 5, we set up the DP-INAR(1) model introducing a Dirichlet process at the top of the hierarchy developed in Section 2. After the forms of the prior distributions have been specified, we derive simple closed forms for the full conditional distributions of the model parameters and latent variables. We carefully consider the choice of prior parameters in Section 6. In Section 7, we show how to use the DP-INAR(1) model to do Bayesian forecasting. In Section 8, we put all the analytical results to work in the forecasting of crime data in Pittsburgh, US. In this application, the DP-INAR(1) model outperforms the original INAR(1) model in the majority of the patrol areas.

2 Generalized INAR(1) model

We begin by generalizing the original INAR(1) model of McKenzie [2] and Al-Osh and Alzaid [3] as follows.

Let $\{Y_{t}\}_{t\geq 1}$ be an integer-valued time series, and let the innovations $\{Z_{t}\}_{t\geq 2}$ , given positive parameters $\{\lambda_{t}\}_{t\geq 2}$ , be a sequence of conditionally independent $\text{Poisson}(\lambda_{t})$ random variables. Given a parameter $\alpha\in[0,1]$ , let $\{B_{i}(t):i\geq 0,t\geq 2\}$ be a family of conditionally independent and identically distributed $\text{Bernoulli}(\alpha)$ random variables. Furthermore, given all the parameters, assume that the innovations $\{Z_{t}\}_{t\geq 2}$ and the family $\{B_{i}(t):i\geq 0,t\geq 2\}$ are conditionally independent. The generalized INAR(1) model is defined by the functional relation

[TABLE]

for $t\geq 2$ , in which $\circ$ denotes the binomial thinning operator, defined by $\alpha\circ Y_{t-1}=\sum_{i=1}^{Y_{t-1}}B_{i}(t)$ , if $Y_{t-1}>0$ , and $\alpha\circ Y_{t-1}=0$ , if $Y_{t-1}=0$ . In the homogeneous case, when all the $\lambda_{t}$ ’s are assumed to be equal, we recover the original INAR(1) model.

This model can be interpreted as specifying a birth-and-death process, in which, at epoch $t$ , the number of cases $Y_{t}$ is equal to the new cases $Z_{t}$ plus the cases that survived from the previous epoch; the role of the binomial thinning operator being to remove a random number of the $Y_{t-1}$ cases present at the previous epoch $t-1$ .

Let $y=(y_{1},\dots,y_{T})$ denote the values of an observed time series. For simplicity, we assume that $Y_{1}=y_{1}$ with probability one. Since the process $\{Y_{t}\}_{t\geq 1}$ is Markovian, the joint distribution of $Y_{1},\dots,Y_{T}$ , given parameters $\alpha$ and $\lambda=(\lambda_{2},\dots,\lambda_{T})$ , can be factored as

[TABLE]

Since, with probability one, $\alpha\circ Y_{t-1}\leq Y_{t-1}$ and $Z_{t}\geq 0$ , by the law of total probability and the definition of the generalized INAR(1) model we have that

[TABLE]

Hence, the generalized INAR(1) model likelihood function is given by

[TABLE]

In the next section, we show how the introduction of certain latent (unobservable) random variables allows us to specify the generalized INAR(1) model in terms of a set of conditional distributions. This alternative representation leads to a factorization of the model joint distribution which is the key element propelling our Monte Carlo simulations.

3 Data augmentation

In the generalized INAR(1) model, suppose that, in addition to the values of the counts $Y_{1},\dots,Y_{T}$ , we could observe the values of the maturations $M_{t}=\alpha\circ Y_{t-1}$ . The $M_{t}$ ’s would tell us the number of cases that matured (survived) from the previous epoch, breaking down $Y_{t}$ into two parcels: maturations plus innovations.

This is an example of data augmentation [5, 6], in which the introduction of the unobservable maturations, with suitable conditional distributions, factors the model into more manageable pieces. Within this data augmentation scheme, we postulate that

[TABLE]

and

[TABLE]

in which $\mathbb{I}_{A}$ denotes the indicator function of the set $A$ , defined by $\mathbb{I}_{A}(x)=1$ , if $x\in A$ , and $\mathbb{I}_{A}(x)=0$ , if $x\notin A$ .

Using the law of total probability and the product rule, we have that

[TABLE]

in which, following the data augmentation scheme, we took advantage of the appropriate conditional independences.

Since

[TABLE]

comparing the expression above for $\Pr\{Y_{t}=y_{t}\mid Y_{t-1}=y_{t-1},\alpha,\lambda_{t}\}$ with the results in the previous section, we come to the conclusion that this is a valid data augmentation scheme, since it induces the same generalized INAR(1) model likelihood function.

In the next section, we recollect the main definitions and results related to the Dirichlet Process which are necessary to build-up our semiparametric hierarchical model. The data augmentation scheme developed above will come in handy in the derivation of the full conditional distributions of the complete model.

4 The Dirichlet process

Suppose that we represent our uncertainties about quantities assuming values in a sampling space $\mathscr{X}$ , with sigma-field $\mathscr{B}$ , by means of an underlying probability space $(\Omega,\mathscr{F},\Pr)$ .

A mapping $\mathbb{G}:\mathscr{B}\times\Omega\to[0,1]$ is a random probability measure if $\mathbb{G}(\,\cdot\,,\omega)$ is a probability measure over $(\mathscr{X},\mathscr{B})$ , for every $\omega\in\Omega$ , and $\mathbb{G}(B)=\mathbb{G}(B,\cdot\,)$ is a random variable, for each $B\in\mathscr{B}$ .

Ferguson [4] defined a random probability measure $\mathbb{G}$ descriptively as follows. Let $\beta$ be a finite nonzero measure over $(\mathscr{X},\mathscr{B})$ and postulate that for each $\mathscr{B}$ -measurable partition $\{B_{1},\dots,B_{k}\}$ of $\mathscr{X}$ the random vector $(\mathbb{G}(B_{1}),\dots,\mathbb{G}(B_{k}))$ has the ordinary Dirichlet distribution with parameters $(\beta(B_{1}),\dots,\beta(B_{k}))$ . In this case, we say that $\mathbb{G}$ is a Dirichlet process with base measure $\beta$ , and use the notation $\mathbb{G}\sim\text{DP}(\beta)$ . Ferguson proved that $\mathbb{G}$ is a properly defined random process in the sense of Kolmogorov’s consistency theorem.

Defining the concentration parameter $\tau=\beta(\mathscr{X})$ , and the base probability measure $G_{0}$ by $G_{0}(B)=\beta(B)/\beta(\mathscr{X})$ , it follows from the usual properties of the Dirichlet distribution that $\text{E}[\mathbb{G}(B)]=G_{0}(B)$ and $\text{Var}[\mathbb{G}(B)]=G_{0}(B)(1-G_{0}(B))/(\tau+1)$ , for every $B\in\mathscr{B}$ . Therefore, $\mathbb{G}$ is centered on $G_{0}$ , and $\tau$ controls the concentration of $\mathbb{G}$ around $G_{0}$ . In terms of the concentration parameter and the base probability measure, we write $\mathbb{G}\sim\text{DP}(\tau\,G_{0})$ .

Inference with the Dirichlet process is tractable. In particular, Ferguson proved that the Dirichlet process is closed under sampling: if $X_{1},\dots,X_{n}$ are conditionally independent and identically distributed, given $\mathbb{G}\sim\text{DP}(\tau\,G_{0})$ , such that $\Pr\{X_{i}\in B\mid\mathbb{G}=G\}=G(B)$ , for every $B$ in $\mathscr{B}$ , then

[TABLE]

Notice that, using the law of total expectation, we have

[TABLE]

almost surely, for every $B$ in $\mathscr{B}$ , in which the second equality follows from the conditional independence of the $X_{i}$ ’s. Hence, the posterior predictive distribution is

[TABLE]

This expression of the posterior predictive distribution unleashes important features of the Dirichlet process, thereby showing how it can be used as a modeling tool. In particular, it defines a data generating process known as the Pólya-Blackwell-MacQueen urn [7]. If we imagine the sequential generation of the $X_{i}$ ’s, for $i=1,\dots,n$ , we see that a value is generated anew from $G_{0}$ with probability proportional to $\tau$ , or we repeat one the previously generated values with probability proportional to its multiplicity. This shows that, almost surely, realizations of a Dirichlet process $\mathbb{G}$ are discrete probability measures, maybe with denumerably infinite support, depending on the nature of $G_{0}$ . Also, this data generating process associated with the Pólya-Blackwell-MacQueen urn implies that the $X_{i}$ ’s are clustered, which is the main feature of the Dirichlet process that we rely on to build our semiparametric model. Antoniak [8] derived the conditional distribution of the number of distinct $X_{i}$ ’s, that is, the number of clusters $K$ , given the concentration parameter $\tau$ , as

[TABLE]

in which $S(n,k)$ denotes the unsigned Stirling number of the first kind.

In the next section, we place a Dirichlet process at the top of the hierarchy of the generalized INAR(1) model, completing the specification of our semiparametric model, thereby being able to represent our uncertainty about the values of the unobservable innovation rates $\lambda_{t}$ ’s, given the information contained in the observed counts. In doing so, we benefit from the clustering properties of the Dirichlet process described above, identifying different regimes for the innovation rates as the process evolves through time.

5 DP-INAR(1) model

The DP-INAR(1) model completes the generalized INAR(1) model defined in Section 2, placing a Dirichlet process at the top of the hierarchy. Formally, we model the innovation rates $\lambda_{2},\dots,\lambda_{T}$ , given $\mathbb{G}\sim\text{DP}(\tau\,G_{0})$ , as conditionally independent and identically distributed, with $\Pr\{\lambda_{t}\in B\mid\mathbb{G}=G\}=G(B)$ , for every Borel set $B$ . The prior distributions for $\alpha$ and $\tau$ are $\text{Beta}(a_{0}^{(\alpha)},b_{0}^{(\alpha)})$ and $\text{Gamma}(a_{0}^{(\tau)},b_{0}^{(\tau)})$ , respectively. The base probability measure $G_{0}$ is a $\text{Gamma}(a_{0}^{(G_{0})},b_{0}^{(G_{0})})$ distribution. In Section 6, we discuss in detail the choice of prior parameters.

Figure 1 displays a graphical representation of the DP-INAR(1) model. In the graph, absence of an arrow connecting two random objects means that they are conditionally independent given their parents (see [9] for a witful discussion of graphical models).

Our next step is to derive the full conditional distributions for all latent variables and model parameters. For convenience, we adopt a simplified notation in the following derivations, using the same letter $p$ to denote different probability functions or densities, with distinctions made clear from the context.

Define $m=(m_{2},\dots,m_{T})$ , and let $\mu_{\mathbb{G}}$ denote the distribution of $\mathbb{G}$ . Marginalizing $\mathbb{G}$ on the graph, we have

[TABLE]

In this expression, the last integral is the joint distribution $p(\lambda_{2},\dots,\lambda_{T})$ , pointing out that the random vector $(\lambda_{2},\dots,\lambda_{T})$ has an exchangeable distribution. Due to this distributional symmetry and the product rule, we can always make $p(\lambda_{2},\dots,\lambda_{T})$ depend on a certain $\lambda_{t}$ only through $p(\lambda_{t}\mid\lambda_{\setminus t})$ , in which $\lambda_{\setminus t}$ denotes the vector $\lambda$ with the component $\lambda_{t}$ removed. Using the symbol $\boldsymbol{\propto}$ to denote proportionality up to a suitable normalization factor, and the label “all others” to designate the observed counts $y$ , and all the other latent variables and model parameters, with the exception of the one under consideration, we have that

[TABLE]

Therefore, the Pólya-Blackwell-MacQueen urn process yields the full conditional distribution of $\lambda_{t}$ as the mixture

[TABLE]

in which $\delta_{\{\lambda_{r}\}}$ denotes a point mass at $\lambda_{r}$ . In the former expression we suppressed the normalization constant which makes all mixture weights add up to one.

The derivations of the full conditionals for $\alpha$ and $m_{t}$ are straightforward.

[TABLE]

West [10] shows how to derive the full conditional distribution of the concentration parameter $\tau$ in simple closed form, after the introduction of an auxiliary random variable $U$ . Using this technique, we have the full conditionals

[TABLE]

in which we suppressed the normalization constant which makes the two mixture weights add up to one.

These full conditional distributions allow us to explore the model posterior distribution by coding a plain Gibbs sampler [11]. Experimentation with this Gibbs sampler shows that, as pointed out by Escobar and West [12] in a similar context, we can improve mixing by resampling simultaneously the values of all $\lambda_{t}$ ’s inside the same cluster at the end of each iteration. Formally, let $(\lambda^{*}_{1},\dots,\lambda^{*}_{k})$ be the $k$ unique values among $(\lambda_{2},\dots,\lambda_{T})$ and define the number of occupants of cluster $j$ by $n_{j}=\sum_{t=2}^{T}\mathbb{I}_{\{\lambda^{*}_{j}\}}(\lambda_{t})$ . It follows that

[TABLE]

for $j=1,\ldots,k$ . After the $\lambda^{*}_{j}$ ’s are sampled from this distribution, we update the values of all $\lambda_{t}$ ’s inside each cluster by the corresponding $\lambda^{*}_{j}$ .

In the next section, we discuss how to choose the prior parameters for the DP-INAR(1) model.

6 Choice of prior parameters

Extending the original scheme proposed by Dorazio [13], we choose the parameters $a_{0}^{(\tau)}$ and $b_{0}^{(\tau)}$ of the $\tau$ prior by minimizing the Kullback-Leibler divergence between the prior distribution of the number of clusters $K$ and a uniform discrete distribution on a suitable range. Using the results in Section 4, the marginal probability function of $K$ can be computed as

[TABLE]

for $k=1,\dots,T-1$ , in which

[TABLE]

Using the information available about the phenomena under consideration to make a sensible choice for the integers $k_{\text{min}}$ and $k_{\text{max}}$ , and letting $q$ be the probability function of a uniform discrete distribution on $\{k_{\text{min}},\dots,k_{\text{max}}\}$ , that is

[TABLE]

we find, by numerical integration and optimization, the values of $a_{0}^{(\tau)}$ and $b_{0}^{(\tau)}$ that minimize the Kullback-Leibler divergence

[TABLE]

We choose the parameters $a_{0}^{(G_{0})}$ and $b_{0}^{(G_{0})}$ of the base probability density $g_{0}$ in a similar fashion, minimizing the Kullback-Leibler divergence between $g_{0}$ and a uniform distribution on a suitable range $[0,\lambda_{\text{max}}]$ , in which $\lambda_{\text{max}}$ is chosen by taking into consideration the available information on the studied phenomena. Letting $h$ be a uniform density on $[0,\lambda_{\text{max}}]$ , that is

[TABLE]

we find, by numerical optimization, the values of $a_{0}^{(G_{0})}$ and $b_{0}^{(G_{0})}$ that minimize the Kullback-Leibler divergence

[TABLE]

Choosing the parameters for the $\alpha$ prior is more straightforward, with $a_{0}^{(\alpha)}=b_{0}^{(\alpha)}=1$ being a natural choice.

7 Bayesian forecasting

The Gibbs sampler described in Section 5 yields, marginally, a sample $\{\alpha^{(n)},\lambda^{(n)}\}_{n=1}^{N}$ from the posterior distribution. Note that, for $n=1,\dots,N$ , we can obtain the number of clusters $k^{(n)}$ as the number of distinct entries in the vector $\lambda^{(n)}=(\lambda^{(n)}_{1},\dots,\lambda^{(n)}_{T})$ . Uncertainty about future counts is represented by the $h$ -steps-ahead posterior predictive distribution

[TABLE]

for some target $h\geq 1$ . In particular, a pointwise forecast is obtained as a suitable summary of this posterior predictive distribution.

Using the law of total probability, the product rule, and simplifying the conditional independences in the model, we can write the posterior predictive probability function as

[TABLE]

A nice property of the DP-INAR(1) model is that we can derive a simple analytical expression for the first factor in the integrand above.

Proposition 7.1.

The probability function of $Y_{t+h}$ , given $Y_{t}=y_{t}$ , $\alpha$ , and $(\lambda_{t+1},\dots,\lambda_{t+h})$ , can be writen as the convolution of a $\text{Bin}(y_{t},\alpha^{h})$ distribution and a $\text{Poisson}(\mu_{h}$ ) distribution,

[TABLE]

in which

[TABLE]

Proof.

We prove the result by induction. For $h=1$ , using a simplified notation, the conditional moment generating function is given by

[TABLE]

But since $\{Z_{t}\}_{t\geq 2}$ is a sequence of conditionally independent random variables, which is also conditionally independent of $\{B_{i}(t):i\geq 0,t\geq 2\}$ , we have that

[TABLE]

which is the product of the generating functions of a $\text{Binomial}(Y_{t},\alpha)$ random variable and a $\text{Poisson}(\lambda_{t+1})$ random variable. Now, suppose the result holds for an arbitrary $h\geq 2$ . Then,

[TABLE]

in which we defined $e^{u}=\alpha e^{s}+(1-\alpha)$ . Consequently, from the induction hypothesis, we have that

[TABLE]

in which $\mu_{h+1}=\alpha\mu_{h}+\lambda_{t+h+1}$ . Hence, the result holds for $h+1$ , completing the proof. ∎

Using the Pólya-Blackwell-MacQueen urn process repeatedly, for $n=1\dots,N$ , we draw a sample $\{\lambda_{T+1}^{(n)},\dots,\lambda_{T+h}^{(n)}\}_{n=1}^{N}$ from $\prod_{i=1}^{h}p(\lambda_{T+i}\mid\lambda_{2},\dots,\lambda_{T+i-1})$ sequentially as follows:

[TABLE]

Combining all these elements, we approximate the integral representation of the $h$ -steps-ahead posterior predictive probability function by the Monte Carlo average

[TABLE]

for $y_{T+h}\geq 0$ .

As a pointwise forecast $\hat{y}_{T+h}$ , we compute the generalized median of the $h$ -steps-ahead posterior predictive distribution, defined by

[TABLE]

We use a form of cross-validation to evaluate the forecasting performance of the model. For an observed time series $y_{1},\dots,y_{T}$ , we pick some $T^{*}<T$ , and treat the counts $y_{T^{*}},\dots,y_{T}$ as a holdout (test) sample. For $t\geq T^{*}$ , we train the model conditioning only on the values $y_{1},\dots,y_{t-1}$ and making an $h$ -steps-ahead prediction $\hat{y}_{t+h}$ . To score the forecast performance, we average the median deviations $|\hat{y}_{t+h}-y_{t+h}|$ over all predictions. This cross-validation procedure is depicted in Figure 2.

In the next section, we assess the forecasting performance of the DP-INAR(1) model, analyzing monthly time series of burglary occurrences in Pittsburgh, USA.

8 Pittsburgh crime data

In this section, we analyze monthly time series of burglary events in Pittsburgh, USA, from January $1990$ to December $2001$ [14]. In this dataset, each time series has a length of $144$ months and corresponds to a certain patrol area.

Figure 3 presents the time series for patrol area $58$ , which displays substantial time heterogeneity and variation in the monthly counts of burglary events. In what follows, we use this patrol area $58$ to exemplify the model training procedure.

To determine the hyperparameters $a_{0}^{(\tau)}$ and $b_{0}^{(\tau)}$ , the optimization procedure described in Section 6, with $k_{\min}=1$ and $k_{\max}=143$ , yields $a_{0}^{(\tau)}=0.519$ and $b_{0}^{(\tau)}=0.003$ .

Using the procedure discussed in Section 6, we control the support of $G_{0}$ by choosing the value of $\lambda_{\max}$ to be the maximum observed count. Figure 4 displays the level curves of $\text{KL}[g_{0}\;\|\;h]$ . The minimum is attained at $a_{0}^{(G_{0})}=1.778$ and $b_{0}^{(G_{0})}=0.096$ .

For the thinning parameter $\alpha$ , we adopt a uniform prior, choosing $a_{0}^{(\alpha)}=b_{0}^{(\alpha)}=1$ .

The marginal posterior distributions of parameters $\alpha$ , $\lambda_{3}$ , $\lambda_{18}$ , and $\lambda_{96}$ are displayed in Figure 5. The posterior distribution of the thinning parameter $\alpha$ is reasonably concentrated, with posterior mean $0.19$ , showing that the autoregressive component is not negligible. The posterior distributions of $\lambda_{3}$ , $\lambda_{18}$ and $\lambda_{96}$ are fairly concentrated as well, with posterior means equal to $6.50$ , $13.61$ and $32.01$ , respectively, showing that different regimes of innovation rates were captured in the learning process. The Markov chains in Figure 6 indicate that proper mixing is achieved by the Gibbs sampler.

Figure 7 shows both the prior and posterior distributions of the number of clusters $K$ . While the prior distribution is reasonably flat in the range $1$ to $143$ , the posterior distribution is concentrated around $7$ , the posterior mode.

With regard to the forecasting performance within this dataset, Tables LABEL:tab:mae_1, LABEL:tab:mae_2, and LABEL:tab:mae_3 present the mean absolute deviations for one, two and three-steps-ahead forecasts for both the DP-INAR(1) and the original INAR(1) model. In these tables, the mean absolute deviations are computed predicting the values of the last $42$ , $43$ , and $44$ months of each time series, according to the desired number of steps ahead, using the cross-validation procedure described in Section 7. The tables show that the DP-INAR(1) model outperforms the INAR(1) in the majority of the districts.

Acknowledgements

Helton Graziadei and Hedibert F. Lopes thank FAPESP for financial support through grants numbers 2017/10096-6 and 2017/22914-5.

Bibliography14

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. Weiß, An introduction to discrete-valued time series . John Wiley & Sons, 2018.
2[2] E. Mc Kenzie, “Some simple models for discrete variate time series,” Journal of the American Water Resources Association , vol. 21, no. 4, pp. 645–650, 1985.
3[3] M. Al-Osh and A. Alzaid, “First-order integer-valued autoregressive (INAR(1)) process: distributional and regression properties,” Statistica Neerlandica , vol. 42, pp. 53–61, 1988.
4[4] T. Ferguson, “A Bayesian analysis of some nonparametric problems,” The Annals of Statistics , vol. 1, no. 2, pp. 209–230, 1973.
5[5] M. Tanner and W. Wong, “The calculation of posterior distributions by data augmentation,” Journal of the American Statistical Association , vol. 82, no. 398, pp. 528–540, 1987.
6[6] D. Van Dyk and X.-L. Meng, “The art of data augmentation,” Journal of Computational and Graphical Statistics , vol. 10, no. 1, pp. 1–50, 2001.
7[7] D. Blackwell and J. Mac Queen, “Ferguson distributions via Pólya urn schemes,” The Annals of Statistics , vol. 1, no. 2, pp. 353–355, 1973.
8[8] C. Antoniak, “Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems,” The Annals of Statistics , vol. 2, no. 6, pp. 1152–1174, 1974.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning a latent pattern of heterogeneity in the

Abstract

1 Introduction

2 Generalized INAR(1) model

3 Data augmentation

4 The Dirichlet process

5 DP-INAR(1) model

6 Choice of prior parameters

7 Bayesian forecasting

Proposition 7.1**.**

Proof.

8 Pittsburgh crime data

Acknowledgements

Proposition 7.1.