Supplementary Notes: Segment Parameter Labelling in MCMC Change   Detection

Alireza Ahrabian

arXiv:1901.05452·eess.SP·January 18, 2019

Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection

Alireza Ahrabian

PDF

Open Access

TL;DR

This paper introduces a Bayesian change point detection method that leverages segment parameter repetition using a Dirichlet process prior to improve segmentation accuracy in time series data.

Contribution

It proposes a novel Bayesian algorithm that incorporates segment class labels with a Dirichlet process prior to exploit parameter patterns across segments.

Findings

01

Enhanced change point detection accuracy.

02

Effective utilization of segment parameter repetition.

03

Demonstrated improvements over traditional methods.

Abstract

This work addresses the problem of segmentation in time series data with respect to a statistical parameter of interest in Bayesian models. It is common to assume that the parameters are distinct within each segment. As such, many Bayesian change point detection models do not exploit the segment parameter patterns, which can improve performance. This work proposes a Bayesian change point detection algorithm that makes use of repetition in segment parameters, by introducing segment class labels that utilise a Dirichlet process prior.

Equations61

x_{τ_{i} + 1 : τ_{i + 1}} = f_{d} (x_{τ_{i} + 1 : τ_{i + 1}}, ϕ_{i}) + n_{τ_{i} + 1 : τ_{i + 1}}

x_{τ_{i} + 1 : τ_{i + 1}} = f_{d} (x_{τ_{i} + 1 : τ_{i + 1}}, ϕ_{i}) + n_{τ_{i} + 1 : τ_{i + 1}}

f_{d} (x_{τ_{i} + 1 : τ_{i + 1}}, ϕ_{i}) = 11 ⋮ 1 0 x_{\scaleto τ_{i} + 1 4 pt} ⋮ x_{\scaleto τ_{i + 1} - 1 4 pt} 00 ⋮ x_{\scaleto τ_{i + 1} - 2 4 pt} \dots \dots ⋱ \dots 00 ⋮ x_{\scaleto τ_{i + 1} - p + 1 4 pt} ϕ_{i}^{0} ⋮ ϕ_{D - 1}^{0}

f_{d} (x_{τ_{i} + 1 : τ_{i + 1}}, ϕ_{i}) = 11 ⋮ 1 0 x_{\scaleto τ_{i} + 1 4 pt} ⋮ x_{\scaleto τ_{i + 1} - 1 4 pt} 00 ⋮ x_{\scaleto τ_{i + 1} - 2 4 pt} \dots \dots ⋱ \dots 00 ⋮ x_{\scaleto τ_{i + 1} - p + 1 4 pt} ϕ_{i}^{0} ⋮ ϕ_{D - 1}^{0}

p (K, τ_{K} ∣ x) \propto \int p (x ∣ Φ, K, τ_{K}) p (Φ, K, τ_{K}) d Φ

p (K, τ_{K} ∣ x) \propto \int p (x ∣ Φ, K, τ_{K}) p (Φ, K, τ_{K}) d Φ

p (x ∣ Φ, K, τ_{K}) = i = 0 \prod K p (x_{τ_{i} + 1 : τ_{i + 1}} ∣ ϕ_{i})

p (x ∣ Φ, K, τ_{K}) = i = 0 \prod K p (x_{τ_{i} + 1 : τ_{i + 1}} ∣ ϕ_{i})

p (x) = v = 1 \sum V π_{v} f (x ∣ θ_{v})

p (x) = v = 1 \sum V π_{v} f (x ∣ θ_{v})

x_{i} ∣ c_{i}, θ

x_{i} ∣ c_{i}, θ

c_{i} ∣ π

θ_{v}

π ∣ α

p (c_{1}, ..., c_{N} ∣ π) = v = 1 \prod V π_{v}^{n_{v}}

p (c_{1}, ..., c_{N} ∣ π) = v = 1 \prod V π_{v}^{n_{v}}

p (c_{i} = v ∣ c_{- i}, α)

p (c_{i} = v ∣ c_{- i}, α)

= \frac{n _{- i, v} + α / V}{N - 1 + α}

p (c_{i} = v ∣ c_{- i}, α) = \frac{n _{- i, v}}{N - 1 + α}

p (c_{i} = v ∣ c_{- i}, α) = \frac{n _{- i, v}}{N - 1 + α}

v^{'} = 1 \sum \infty p (c_{i} = v^{'} ∣ c_{- i}, α) = 1

v^{'} = 1 \sum \infty p (c_{i} = v^{'} ∣ c_{- i}, α) = 1

v^{'} = 1 \sum V^{'} p (c_{i} = v^{'} ∣ c_{- i}, α) + v^{'} = V^{'} + 1 \sum \infty p (c_{i} = v^{'} ∣ c_{- i}, α) = 1

p (c_{i} \neq = c_{l} for all i \neq = l ∣ c_{- i})

p (c_{i} \neq = c_{l} for all i \neq = l ∣ c_{- i})

= 1 - v^{'} = 1 \sum V^{'} p (c_{i} = v^{'} ∣ c_{- i}, α)

= \frac{α}{N - 1 + α}

p (c_{i} = v ∣ c_{- i}, x_{i}, θ) \propto \frac{n _{- i, v}}{N - 1 + α} L (x_{i} ∣ θ_{v})

p (c_{i} = v ∣ c_{- i}, x_{i}, θ) \propto \frac{n _{- i, v}}{N - 1 + α} L (x_{i} ∣ θ_{v})

p (c_{i} \neq = c_{l}

p (c_{i} \neq = c_{l}

x_{i} ∣ θ_{i}

x_{i} ∣ θ_{i}

θ_{i}

G

p (Φ ∣ Φ^{c}, Σ^{c}, π) = v = 1 \sum V π_{v} M N (Φ ∣ ϕ_{v}^{c}, Σ_{v}^{c})

p (Φ ∣ Φ^{c}, Σ^{c}, π) = v = 1 \sum V π_{v} M N (Φ ∣ ϕ_{v}^{c}, Σ_{v}^{c})

x_{τ_{i} + 1 : τ_{i + 1}} ∣ ϕ_{i}^{c}, σ_{i}^{2}

x_{τ_{i} + 1 : τ_{i + 1}} ∣ ϕ_{i}^{c}, σ_{i}^{2}

σ_{i}^{2}

ϕ_{i} ∣ ϕ_{i}^{c}, Σ_{i}^{c}

(ϕ_{i}^{c}, Σ_{i}^{c})

G

x_{τ_{i} + 1 : τ_{i + 1}} ∣ τ_{K}, ϕ_{i}

τ_{K}, K

p (ϕ_{v}^{c} ∣ c_{K}, τ_{K}, K, Σ_{v}^{c}, x) \sim M N (μ_{v}^{ϕ}, Σ_{v}^{ϕ}) v = 1, ..., V

p (ϕ_{v}^{c} ∣ c_{K}, τ_{K}, K, Σ_{v}^{c}, x) \sim M N (μ_{v}^{ϕ}, Σ_{v}^{ϕ}) v = 1, ..., V

p (Σ_{v}^{c} ∣ c_{K}, τ_{K}, K, ϕ_{v}^{c}, x) \sim I W (α_{v}^{ϕ}, B_{v}^{ϕ}) v = 1, ..., V

p (Σ_{v}^{c} ∣ c_{K}, τ_{K}, K, ϕ_{v}^{c}, x) \sim I W (α_{v}^{ϕ}, B_{v}^{ϕ}) v = 1, ..., V

p (σ_{v}^{2} ∣ c_{K}, τ_{K}, K, x) \sim I W (ν + d_{v}, γ + Y_{v}^{T} P_{v} Y_{v})

p (σ_{v}^{2} ∣ c_{K}, τ_{K}, K, x) \sim I W (ν + d_{v}, γ + Y_{v}^{T} P_{v} Y_{v})

p (Φ^{c}, σ^{2}, τ_{K}, K, λ ∣ c_{K}, x) \propto p (x ∣ Φ^{c}, σ^{2}, K, τ_{K}, c_{K}) \times p (K, τ ∣ λ) p (λ) v = 1 \prod V p (ϕ_{v}^{c} ∣ λ_{ϕ}, δ) p (σ_{v}^{2} ∣ ν, γ)

p (Φ^{c}, σ^{2}, τ_{K}, K, λ ∣ c_{K}, x) \propto p (x ∣ Φ^{c}, σ^{2}, K, τ_{K}, c_{K}) \times p (K, τ ∣ λ) p (λ) v = 1 \prod V p (ϕ_{v}^{c} ∣ λ_{ϕ}, δ) p (σ_{v}^{2} ∣ ν, γ)

p (x ∣ Φ^{c}, σ^{2}, K, τ_{K}, c_{K}) = v = 1 \prod V i : c_{i} = v \prod p (x_{τ_{i} + 1 : τ_{i + 1}} ∣ ϕ_{v}^{c}, σ_{v}^{2})

p (x ∣ Φ^{c}, σ^{2}, K, τ_{K}, c_{K}) = v = 1 \prod V i : c_{i} = v \prod p (x_{τ_{i} + 1 : τ_{i + 1}} ∣ ϕ_{v}^{c}, σ_{v}^{2})

p (τ_{K}, K ∣ c_{K}, x) \propto v = 1 \prod V \frac{2 ^{\frac{ν}{2}}}{Γ ( \frac{ν}{2} )} Γ (K + 1) Γ (N - K + 1) (\frac{γ}{2})^{\frac{ν}{2}} \times Γ (\frac{d _{v} + ν}{2}) π^{- \frac{d _{v}}{2}} [γ + Y_{v}^{T} P_{v} Y_{v}]^{- \frac{d _{v} + ν}{2}} ∣ M_{v} ∣^{- \frac{1}{2}}

p (τ_{K}, K ∣ c_{K}, x) \propto v = 1 \prod V \frac{2 ^{\frac{ν}{2}}}{Γ ( \frac{ν}{2} )} Γ (K + 1) Γ (N - K + 1) (\frac{γ}{2})^{\frac{ν}{2}} \times Γ (\frac{d _{v} + ν}{2}) π^{- \frac{d _{v}}{2}} [γ + Y_{v}^{T} P_{v} Y_{v}]^{- \frac{d _{v} + ν}{2}} ∣ M_{v} ∣^{- \frac{1}{2}}

r_{bi r t h} = \frac{p ( τ _{K + 1} , K + 1∣ c _{K + 1} , x )}{p ( τ _{K} , K ∣ c _{K} , x )} \frac{q ( τ _{K} ∣ τ _{K + 1} ) q ( K ∣ K + 1 )}{q ( τ _{K + 1} ∣ τ _{K} ) q ( K + 1∣ K )}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Time Series Analysis and Forecasting · Algorithms and Data Compression

Full text

Supplementary Notes: Segment Parameter Labelling in MCMC Change Detection

Alireza Ahrabian

Abstract

This work addresses the problem of segmentation in time series data with respect to a statistical parameter of interest in Bayesian models. It is common to assume that the parameters are distinct within each segment. As such, many Bayesian change point detection models do not exploit the segment parameter patterns, which can improve performance. This work proposes a Bayesian change point detection algorithm that makes use of repetition in segment parameters, by introducing segment class labels that utilise a Dirichlet process prior.

Index Terms— Change Detection, Markov Chain Monte Carlo, Dirichlet Process, Autoregressive Models.

I Introduction

Change detection algorithms are an important tool in the exploratory analysis of real world data. Namely, such algorithms seek to partition time series data with respect to a statistical parameter of interest, where examples include, the mean, variance and autoregressive model weights, to name but a few. Change point detection algorithms have found applications in fields ranging from financial to bio-medical data analysis.

Accordingly, many approaches have been proposed to address the problem of segmenting time series data. Namely, the work in [1] proposed an online method, based on the likelihood ratio test that both detects and estimates a change in the statistical parameter of interest. More recently, the work in [2] proposed a computationally efficient algorithm that segments data by minimising a cost function using dynamic programming, while [3] utilises a classification algorithm (kernel-SVM) in order to estimate the change point locations. Bayesian approaches to time series segmentation have also proven to be useful. In particular, by deriving the posterior distribution of the parameters interest (that includes, the number of segments, as well as, the change point locations), one can then use suitable methodologies (for evaluating the posterior distribution) in order to both infer as well as predict. Examples of such work includes, a fully hierarchical Bayesian model proposed in [4], that utilised a Markov Chain Monte Carlo (MCMC) sampler in order to evaluate the posterior distribution of the target parameters. While an exact (that is, avoiding the use MCMC sampling algorithms) segmentation algorithm was developed in [5], by evaluating the posterior distribution using an recursive algorithm.

The change point algorithms mentioned in the previous section, assume that statistical parameters of interest from different segments are distinct and thus independent. However, many real world processes can often be modeled by parameters generated from a fixed number of states, where such parameters can be re-assigned more than once (parameter repetition). In particular, hidden Markov models (HMM) and their extensions [6], assign to each data point a state label (that evolves according to Markov chain) corresponding to a set of parameters that govern the emission probability of generating the data point. While, mixture models [7][8] and their extensions (e.g. Dirichlet processes [7]) assume that each data point is generated from a probability distribution; with the parameters of the distribution belong to state drawn from a discrete distribution (that captures the clustering of the data points, with respect to the parameter of interest). However, it should be noted that such methods assign a parameter belonging to a particular state to each time point, and not to the parameters corresponding to a given segment.

This work is based on the method outlined in [9] that incorporates segment parameter repetition in the estimation process of a change point detection algorithm, when estimating changes in autoregressive processes. Namely, the work proposed to extend the Bayesian change point detection algorithm proposed in [4], by incorporating a parameter class variable that utilises a Dirichlet process prior for identifying the number of distinct segment parameters. By including the parameter class variable, segment parameter repetition is captured during the estimation process of the transition times (change point locations), resulting in more robust segmentation.

II Background

II-A MCMC Change Point Detection

In this work, we consider the change point detection algorithm proposed in [4]. That is, given a set of transition times $\boldsymbol{\tau}_{K}=[\tau_{1},...,\tau_{K}]$ where $\tau_{0}=1$ and $\tau_{K+1}=N$ , that partition a data set $\mathbf{x}$ into $K+1$ segments, where for each segment (consisting of data points between the time indices $\tau_{i}+1\leq\tau\leq\tau_{i+1}$ ) there exists the following functional relationship between the data points $\boldsymbol{\text{x}}_{\tau_{i}+1:\tau_{i+1}}$ and the statistical parameter $\boldsymbol{\phi}_{i}\in\mathbb{R}^{D}$ , that is

[TABLE]

for segments indexed by $i=\{0,\dots,K\}$ , where $\boldsymbol{\text{n}}_{\tau_{i}+1:\tau_{i+1}}$ is a set of i.i.d. Gaussian noise samples with zero mean and variance $\sigma_{i}^{2}$ . In particular, an example of the functional relationship $f_{d}(\boldsymbol{\text{x}}_{\tau_{i}+1:\tau_{i+1}},\boldsymbol{\phi}_{i})$ is given by the autoregressive model of order $D-1$ , that is

[TABLE]

where $\boldsymbol{\phi}_{i}=[\phi^{0}_{i},...,\phi^{0}_{D-1}]^{T}$ . Accordingly, one can define the posterior distribution for the target parameters of interest; namely, the number of transition times $K$ and set of transition time points $\boldsymbol{\tau}_{K}$ , that is

[TABLE]

where $\boldsymbol{\Phi}=[\boldsymbol{\phi}_{0},...,\boldsymbol{\phi}_{K}]$ corresponds to the vector of segment parameters that is treated as a nuisance parameter and thus integrated out of the posterior distribution. Finally, it should be noted that the likelihood function in the posterior distribution (2) assumes that the parameters $\boldsymbol{\phi}_{i}$ are distinct for each segment [4], that is

[TABLE]

II-B Dirichlet Process Mixture Model

Consider a set of $N$ exchangeable data points $\mathbf{x}$ , such that probability distribution of the data points $p(\mathbf{x})$ can be represented by a set of $V$ class distributions, that is

[TABLE]

where $\pi_{v}$ is the mixing coefficient and $\theta_{v}$ corresponds to the class parameter/s of the probability distribution $f(.)$ . The Dirichlet process mixture model (DPMM) [7] can be seen as the limiting case of the mixture model (MM) specified in (4). This can be seen by first considering the re-formulation of the mixture model shown in (4), by introducing the class indicator random variable $c_{i}$ for the $i^{\text{th}}$ data point

[TABLE]

where $\boldsymbol{\theta}=[\theta_{1},\dots,\theta_{V}]$ , $\boldsymbol{\pi}=[\pi_{1},\dots,\pi_{V}]$ and $G_{0}$ denotes the prior distribution on the parameters $\theta_{v}$ . The probability of selecting a given class $c_{i}$ is determined by the mixing coefficients $\boldsymbol{\pi}$ . In particular, the the joint distribution of the class indicator random variables is given by

[TABLE]

where $n_{v}$ corresponds to the number of data points assigned to class $v$ . The distribution of the the $i^{\text{th}}$ class indicator variable given all other class variables $\boldsymbol{\text{c}}_{-i}$ that excludes the $c_{i}$ , is given by the following

[TABLE]

where $n_{-i,v}$ is the number of data points (excluding $x_{i}$ ) assigned to class $v$ and $p(\boldsymbol{\pi}|\alpha)$ is a symmetric Dirichlet distribution with parameter $\alpha/V$ .

Taking the number of classes $V\rightarrow\infty$ and assuming that there exists a finite number of represented classes $V^{\prime}$ , such that the number of data points assigned to each class is greater than zero; accordingly the conditional probability in (3) for represented classes is given by

[TABLE]

That is, (4) is the probability of assigning the class variable $c_{i}$ to the represented class $v$ . Furthermore, as $V\rightarrow\infty$ there exists an countably infinite number of classes that excludes the represented classes such that, $c_{i}\neq c_{l}$ , for all $l\neq i$ . In order to calculate the probability of assigning $c_{i}$ to a new class, consider the following

[TABLE]

where $\sum_{v^{\prime}=1}^{V^{\prime}}p(c_{i}=v|\boldsymbol{\text{c}}_{-i},\alpha)$ corresponds to the sum of the probabilities of the represented classes (that is, there exists a data point assigned to the class) and $\sum_{v^{\prime}=V^{\prime}+1}^{\infty}p(c_{i}=v|\boldsymbol{\text{c}}_{-i},\alpha)$ the sum of the probabilities of all other classes. Accordingly the probability of assigning $c_{i}$ to a new class,111The classes that exclude the represented classes. is given by the following

[TABLE]

The conditional posterior distribution of the class variable $c_{i}$ , given (4) and (6), can then be determined as follows

[TABLE]

for a class where $n_{-i,v}>0$ and corresponds to the likelihood $L(x_{i}|\theta_{v})$ . The conditional posterior probability for assigning a data point to a new class is given by

[TABLE]

Finally, it should be noted that the Dirichlet process mixture model can be written as follows

[TABLE]

where $G$ is drawn from the Dirichlet process with base measure $G_{0}$ .

III Proposed Method

In this work we propose to extend the model developed in [4], by assigning a class variable $c_{i}$ for the parameters $\phi_{i}$ in each segment (for $i=0,...,K$ ) thereby capturing the dependencies between segment parameters for improved change point estimation. This performance improvement is achieved by concatenating data points of segments with the same class labels; thereby providing more degrees of freedom when assessing if a change point exists. In the sections that follow, we will provide a detailed description of the proposed method.

III-A Bayesian Model

In particular, we modify the set of distinct segment parameters, $\boldsymbol{\Phi}=[\boldsymbol{\phi}_{0},...,\boldsymbol{\phi}_{K}]$ , by introducing the class variable $c_{i}$ , such that, $\boldsymbol{\Phi}=[\boldsymbol{\phi}_{c_{0}},...,\boldsymbol{\phi}_{c_{K}}]$ ; where the parameters in each segment are effectively then being drawn from a set of class parameters, $\boldsymbol{\Phi}^{c}=[\boldsymbol{\phi}^{c}_{1},...,\boldsymbol{\phi}^{c}_{V}]$ , where $V\leq K+1$ . That is, each segment parameter can be formulated as a multivariate Gaussian mixture model (the Gaussian assumption enables tractable posterior distributions) of the class parameters $\boldsymbol{\phi}^{c}_{v}$

[TABLE]

where $\boldsymbol{\Sigma}^{c}=[\Sigma^{c}_{1},\dots,\Sigma^{c}_{V}]$ and $\Sigma^{c}_{v}\in\mathbb{R}^{D\times D}$ . Accordingly, we present a change point estimation model that incorporates the class variable $c_{i}$ and a Dirichlet process prior on the likelihood on the class probabilities, that is

[TABLE]

for $i=0,...,K$ . Furthermore, $\text{Bin}(.)$ corresponds to a Binomial distribution, $G_{0}$ is the joint prior distribution of both the class parameter $\boldsymbol{\phi}^{c}_{i}$ and the variance of the class parameter $\Sigma^{c}_{i}$ , the prior distribution of the variance of the data points with the same class label is given by $G_{\sigma^{2}}$ and $f_{j}(.)$ corresponds to the joint Normal distribution.

The posterior distribution of the model in (11), consists of the following parameters, $\{\boldsymbol{\tau}_{K},K,\boldsymbol{\text{c}}_{K},\hat{\boldsymbol{\Phi}^{c}},\hat{\boldsymbol{\Sigma}}^{c},\boldsymbol{\sigma}^{2}\}$ , where $\boldsymbol{\sigma}^{2}=[\sigma_{1}^{2},\dots,\sigma_{V}^{2}]$ . Inference of the parameters is carried out by using a Metropolis-Hastings-within-Gibbs sampling scheme. The Gibbs moves are performed on each parameter in the set, $\{\boldsymbol{\text{c}}_{K},\hat{\boldsymbol{\Phi}^{c}},\hat{\boldsymbol{\Sigma}}^{c},\boldsymbol{\sigma}^{2}\}$ , while a variation of the Metropolis-Hastings algorithm is used to obtain samples for the parameters $\{\boldsymbol{\tau}_{K},K\}$ . The marginal posterior distribution of parameters $\{\boldsymbol{\text{c}}_{K},\hat{\boldsymbol{\Phi}^{c}},\hat{\boldsymbol{\Sigma}}^{c},\boldsymbol{\sigma}^{2}\}$ are given by the following; namely, the marginal posterior distribution of the class parameters $\boldsymbol{\phi}^{c}_{v}$ , assuming the conjugate prior distribution, $p(\boldsymbol{\phi}^{c}_{v}|\boldsymbol{\lambda}_{\phi},\delta)\sim\mathcal{N}(\boldsymbol{\lambda}_{\phi},\delta\sigma^{2}_{v}\text{I}_{D})$ , where $\boldsymbol{\lambda}_{\phi}\in\mathbb{R}^{D}$ and $\text{I}_{D})$ is an identity matrix of dimension $D$ , is given by

[TABLE]

where $\boldsymbol{\mu}^{\phi}_{v}=\Sigma_{\phi}^{-1}\left(n_{v}\bar{\boldsymbol{\phi}}^{c}_{v}(\Sigma^{c})^{-1}+\boldsymbol{\lambda}_{\phi}^{T}\delta^{-1}\sigma^{-2}_{v}\text{I}_{D}\right)$ and $\Sigma^{\phi}_{v}=(n_{v}(\Sigma^{c})^{-1}+\delta^{-1}\sigma^{-2}_{v}\text{I}_{D})^{-1}$ for $\bar{\boldsymbol{\phi}}^{c}_{v}=\frac{1}{n_{v}}\sum_{i:c_{i}=v}\boldsymbol{\phi}_{i}^{T}$ where $n_{v}$ corresponds to the number of segment parameters $\boldsymbol{\phi}_{i}$ assigned to class $v$ . The marginal posterior distribution (Inverse Wishart distributed) of the class covariance matrix $\Sigma^{c}_{v}$ , given the conjugate prior distribution, $p(\Sigma^{c}_{v}|\beta,\omega)\sim\mathcal{IW}(\beta,\boldsymbol{\Omega})$ where $\beta\in\mathbb{R}$ and $\boldsymbol{\Omega}\in\mathbb{R}^{D\times D}$ , is shown by the following

[TABLE]

where $\alpha^{\phi}_{v}=n_{v}+\beta$ and $\boldsymbol{B}^{\phi}_{v}=\beta\boldsymbol{\Omega}+\sum_{i:c_{i}=v}(\boldsymbol{\phi}_{i}-\boldsymbol{\phi}^{c}_{i})(\boldsymbol{\phi}_{i}-\boldsymbol{\phi}^{c}_{i})^{T}$ . Furthermore, the posterior distribution of the variance $\sigma^{2}_{v}$ for the data points from segments with the same class label $v$ , along with the inverse Gamma prior distribution, $p(\sigma^{2}_{v}|\nu,\gamma)\sim\mathcal{IG}(\nu,\gamma)$ , is given by (for $v=1,...,V$ )

[TABLE]

where $d_{v}$ is the number of data points with label $v$ , $Y_{v}$ is the concatenated vector222Furthermore, $\boldsymbol{\text{G}}_{v}$ is concatenation of input data points such that, $Y_{v}=\boldsymbol{\text{G}}_{v}\boldsymbol{\phi}^{c}_{v}$ is satisfied. of all data points with the same segment label $v$ , $\boldsymbol{\text{P}}_{v}=\left(\mathbf{I}_{d_{v}}-\boldsymbol{\text{G}}_{v}\boldsymbol{\text{M}}_{v}\boldsymbol{\text{G}}_{v}^{T}\right)$ , with $\boldsymbol{\text{M}}_{v}=(\boldsymbol{\text{G}}_{v}^{T}\boldsymbol{\text{G}}_{v}+\delta^{-1}\text{I}_{D})^{-\frac{1}{2}}$ . Finally, the marginal posterior distribution for the class labels $c_{i}$ are given by: $p(c_{i}=v|\boldsymbol{\text{c}}_{-i},\boldsymbol{\phi}_{i},\boldsymbol{\phi}^{c}_{v},\Sigma_{v}^{c})$ for $n_{-i,v}>0$ (shown in (7)) with likelihood $L(\boldsymbol{\phi}_{i}|\boldsymbol{\phi}^{c}_{v},\Sigma_{v}^{c})\sim\mathcal{MN}(\boldsymbol{\phi}_{i}|\boldsymbol{\phi}^{c}_{v},\Sigma_{v}^{c})$ ; while for the posterior probability for a new class $p(c_{i}\neq c_{l}\quad\text{for all}\quad i\neq l|\boldsymbol{\text{c}}_{-i},\boldsymbol{\phi}_{i})$ is given by (8) (see [] for more details).

The conditional posterior distribution of the parameters $\{\boldsymbol{\tau}_{K},K\}$ can be obtained by first considering the following marginal posterior distribution, $p(\boldsymbol{\tau}_{K},K|\lambda,\boldsymbol{\text{c}}_{K},\boldsymbol{\Phi}^{c},\boldsymbol{\sigma}^{2},\boldsymbol{\text{x}})$ where $\lambda$ is an hyperparameter of the following prior distribution, $p(\boldsymbol{\tau}_{K},K|\lambda)=\lambda^{K}(1-\lambda)^{T-K-1}$ . Having selected the appropriate conjugate priors, we can integrate out the nuisance parameters $\{\boldsymbol{\Phi}^{c},\boldsymbol{\sigma}^{2},\lambda\}$ , thereby significantly reducing the number of parameters required to specify the posterior distribution for $\{\boldsymbol{\tau}_{K},K\}$ . To this end, we first obtain the following posterior distribution (that incorporates the prior distributions of the nuisance parameters)

[TABLE]

where $p(\lambda)$ has uniform probability over the interval $[0,1]$ and the likelihood function is given by

[TABLE]

where by combining data points from the same segment class label $v$ , we can potentially obtain more accurate parameter estimation owing to the increased number of number available for estimating $\{\boldsymbol{\tau}_{K},K\}$ . Integration of (15) with respect to the parameters $\{\boldsymbol{\phi}^{c}_{v},\sigma_{v}^{2},\lambda\}$ results in the following expression for the conditional posterior distribution of the parameters $\{\boldsymbol{\tau}_{K},K\}$

[TABLE]

Finally, we note that there are some challenges from drawing samples from (16) due to the dependence on $\boldsymbol{\text{c}}_{K}$ that we have addressed in the next section.

III-B Gibbs Sampling

A summary of the Gibbs sampling scheme for drawing samples for the parameters $\{\boldsymbol{\tau}_{K},K,\boldsymbol{\text{c}}_{K},\hat{\boldsymbol{\Phi}^{c}},\hat{\boldsymbol{\Sigma}}^{c},\boldsymbol{\sigma}^{2}\}$ , is provided in Algorithm 1. Observe that the parameters $\{\boldsymbol{\tau}_{K},K\}$ , are dependent on the segment class variables $\{\boldsymbol{\text{c}}_{K}\}$ , and in turn, the segment class variables are dependent on the parameters $\{\boldsymbol{\tau}_{K},K,\hat{\boldsymbol{\Phi}^{c}},\hat{\boldsymbol{\Sigma}}^{c},\boldsymbol{\sigma}^{2}\}$ . Furthermore, the marginal posterior distributions for both $\{\boldsymbol{\tau}_{K},K\}$ and $\{\boldsymbol{\text{c}}_{K}\}$ are intractable and therefore require sampling schemes; in particular, a nested Gibbs sampling scheme was used in order to draw samples for the class variables $\{\boldsymbol{\text{c}}_{K}\}$ , due to the dependence on the parameters $\{\hat{\boldsymbol{\Phi}^{c}},\hat{\boldsymbol{\Sigma}}^{c},\boldsymbol{\sigma}^{2}\}$ (as shown in Algorithm 1). While, A modification of the Metropolis-Hastings algorithm (having integrated out the nuisance parameters) outlined in [4] was used in order to draw samples from the conditional posterior distribution, $p(\boldsymbol{\tau}_{K},K|\boldsymbol{\text{c}}_{K},\mathbf{x})$ ; in particular, a variation was developed that incorporates the segment labels $\boldsymbol{\text{c}}_{K}$ .

Given the $j^{th}$ samples, $\{\boldsymbol{\tau}_{K},K\}_{j}$ , we first select with a certain probability, one the following:

•

$K\rightarrow K+1$ : create a new change point (birth), with probability, $b$

•

$K\rightarrow K-1$ : remove an existing change (death) with probability, $d$

•

$K\rightarrow K$ : update of change point positions with probability, $u$

where $b=d=u$ for $0<K<K_{max}$ , and $b+d+u=1$ for $0\leq K\leq K_{max}$ . Furthermore, for $K=0$ , $d=0$ and $b=u$ , while for $K=K_{max}$ , $b=0$ and $d=u$ .

A birth move consists of proposing a new transition time $\tau_{prop}$ , with the following proposal distribution, $q(\boldsymbol{\tau}_{K+1}|\boldsymbol{\tau}_{K})=q(\tau_{prop}|\boldsymbol{\tau}_{K})=\{\frac{1}{N-K-2}\quad\text{for}\quad\tau_{prop}\in S_{prop}\}$ , where $\boldsymbol{\tau}_{K+1}$ corresponds to the set of change points that includes both $\tau_{prop}$ and $\boldsymbol{\tau}_{K}$ , while $S_{prop}$ corresponds to the set of time indices $[2,N-1]$ excluding the time points $\boldsymbol{\tau}_{K}$ . Conversely, the proposal distribution for removing $\tau_{prop}$ from $\boldsymbol{\tau}_{K+1}$ , is given by $q(\boldsymbol{\tau}_{K}|\boldsymbol{\tau}_{K+1})=\{\frac{1}{K+1}\quad\text{for}\quad\tau_{prop}\in\boldsymbol{\tau}_{K+1}\}$ . Accordingly, the proposed transition time $\tau_{prop}$ is accepted with the following probability, $\alpha_{birth}=\text{min}\{1,r_{birth}\}$ ,

[TABLE]

with $q(K+1|K)=b$ and $q(K|K+1)=d$ , corresponding to the proposal distributions for the unit increment and decrement (respectively) of the parameter $K$ . It should be noted that in order to determine the acceptance ratio $r_{birth}$ , we need to evaluate $p(\boldsymbol{\tau}_{K+1},K+1|\boldsymbol{\text{c}}_{K+1},\mathbf{x})$ , where there is now a dependence on $\boldsymbol{\text{c}}_{K+1}$ . This dependence arises due to the proposed transition time $\tau_{prop}$ , splitting the segment between the time indices $\{\tau_{i},\tau_{i+1}\}$ into $\{\tau_{i},\tau_{prop},\tau_{i+1}\}$ , as well as, splitting the segment class variable $\{c_{i}\}$ , into two new class variables $\{\hat{c_{i}},\hat{c}_{i+1}\}$ . As we have not yet inferred the new class variables from the conditional class posterior distributions, we assume that the two classes $\{\hat{c_{i}},\hat{c}_{i+1}\}$ are distinct (that is, $\hat{c}_{j}\neq c_{k}$ for all $j\neq k$ and $j=1,2$ ) and thus independent from all other segments, to circumvent the lack of information we have for assignment to an existing class (please refer to Figure ).

The death move proposes to remove a transition time $\tau_{prop}$ , by choosing with uniform probability from the set $\boldsymbol{\tau}_{K}$ ; where the removal of $\tau_{prop}$ is accepted with probability $\alpha_{death}=\text{min}\{1,r_{birth}^{-1}\}$ . As in the previous case (birth move), we need to determine $r_{birth}$ , however, now we need to evaluate $p(\boldsymbol{\tau}_{K-1},K-1|\boldsymbol{\text{c}}_{K-1},\mathbf{x})$ . That is, the segments between the transition times, $\{\tau_{i},\tau_{prop}\}$ and $\{\tau_{prop},\tau_{i+2}\}$ where $\tau_{prop}=\tau_{i+1}$ , are combined into one segment $\{\tau_{i},\tau_{i+2}\}$ , along with the segment class variables $\{c_{i},c_{i+1}\}$ being combined into one segment with a new class variable $\{\hat{c}_{i}\}$ . Using the argument utilised for the birth of a change point we assign a distinct value to the new class variable, that is, $\hat{c}_{i}\neq c_{j}$ for all $j\neq i$ .

The update of the transitions times is carried by first removing the $j^{th}$ transition time index $\tau_{j}$ from $\boldsymbol{\tau}_{K}$ and proposing a new change point at some new location, for all $j=\{1,\dots,K\}$ . That is, the death move is first applied followed by a birth move for all transition times in $\boldsymbol{\tau}_{K}$ .

Bibliography9

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Brandt, “Detecting and estimating parameter jumps using ladder algorithms and likelihood ratio tests,” IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 1017–1020, 1983.
2[2] R. Killick, P. Fearnhead, and I. A. Eckley, “Optimal detection of changepoints with a linear computational cost,” Journal of the American Statistical Association , vol. 500, no. 107, pp. 1590––1598, 2012.
3[3] F. Desobry, M. Davy, and C. Doncarli, “An online kernel change detection algorithm,” IEEE Transactions on Signal Processing , vol. 53, no. 8, pp. 2961–2974, 2005.
4[4] E. Punskaya, C. Andrieu, A. Doucet, and W. Fitzgerald, “Bayesian curve fitting using MCMC with applications to signal segmentation,” IEEE Transactions on Signal Processing , vol. 50, no. 3, pp. 747–758, 2002.
5[5] P. Fearnhead, “Exact Bayesian curve fitting and signal segmentation,” IEEE Transactions on Signal Processing , vol. 53, no. 6, pp. 2160–2166, 2005.
6[6] I. Rezek and S. Roberts, “Ensemble hidden Markov models with extended observation densities for biosignal analysis,” Springer London , pp. 419–450, 2005.
7[7] R. M. Neal, “Markov chain sampling methods for Dirichlet process mixture models,” Journal of Computational and Graphical Statistics , vol. 2, no. 2, pp. 249–265, 2000.
8[8] C. E. Rasmussen, “The infinite Gaussian mixture model,” Advances in Neural Information Processing Systems 12 , pp. 554–560, 2000.