Infinite Mixture Model of Markov Chains

Jan Reubold; Thorsten Strufe; Ulf Brefeld

arXiv:1706.06178·stat.ML·June 21, 2017

Infinite Mixture Model of Markov Chains

Jan Reubold, Thorsten Strufe, Ulf Brefeld

PDF

Open Access

TL;DR

This paper introduces a Bayesian nonparametric mixture model for categorical time series that captures multiple underlying patterns, improving segmentation and prediction with interpretable results.

Contribution

It extends hierarchical hidden Markov models by incorporating structural information and offers an efficient inference scheme for better pattern detection.

Findings

01

Model effectively identifies underlying patterns in data.

02

Achieves superior segmentation and prediction performance.

03

Results are interpretable and applicable to real-world data.

Abstract

We propose a Bayesian nonparametric mixture model for prediction- and information extraction tasks with an efficient inference scheme. It models categorical-valued time series that exhibit dynamics from multiple underlying patterns (e.g. user behavior traces). We simplify the idea of capturing these patterns by hierarchical hidden Markov models (HHMMs) - and extend the existing approaches by the additional representation of structural information. Our empirical results are based on both synthetic- and real world data. They indicate that the results are easily interpretable, and that the model excels at segmentation and prediction performance: it successfully identifies the generating patterns and can be used for effective prediction of future observations.

Equations43

β_{i}^{'} \sim \mbox B e t a (1, γ) β_{i} = β_{i}^{'} k = 1 \prod i - 1 (1 - β_{k}^{'}) i = 1, 2, \dots \mbox,

β_{i}^{'} \sim \mbox B e t a (1, γ) β_{i} = β_{i}^{'} k = 1 \prod i - 1 (1 - β_{k}^{'}) i = 1, 2, \dots \mbox,

G = k = 1 \sum \infty β_{k} δ_{\tilde{θ}_{k}} .

G = k = 1 \sum \infty β_{k} δ_{\tilde{θ}_{k}} .

π_{j i}^{'}

π_{j i}^{'}

π_{j i}

π_{j, \cdot} \sim \mbox S B P_{2} (α + κ, \frac{α β + κ δ _{j}}{α + κ}),

π_{j, \cdot} \sim \mbox S B P_{2} (α + κ, \frac{α β + κ δ _{j}}{α + κ}),

β

β

π_{j, \cdot}

θ_{i, \cdot, \cdot}

ω_{t}

p_{t}

β

β

ψ_{i}

\begin{split}m_{T+1,T}(i)&=1\\ m_{t,t-1}(i)&=\left\{\begin{array}[]{l l}m_{t+1,t}(i)&\quad\text{if $r_{t}=B$};\\ m_{t+1,t}(i)\cdot\beta_{i}\cdot\theta_{i,p_{t},B}&\quad\text{if $y_{t}=B$};\\ \Omega_{t,i}&\quad\text{otherwise}.\end{array}\right.\end{split}

\begin{split}m_{T+1,T}(i)&=1\\ m_{t,t-1}(i)&=\left\{\begin{array}[]{l l}m_{t+1,t}(i)&\quad\text{if $r_{t}=B$};\\ m_{t+1,t}(i)\cdot\beta_{i}\cdot\theta_{i,p_{t},B}&\quad\text{if $y_{t}=B$};\\ \Omega_{t,i}&\quad\text{otherwise}.\end{array}\right.\end{split}

L_{t, i}^{intra} = θ_{i, p_{t}, y_{t}} \cdot ψ_{i, p_{t}},

L_{t, i}^{intra} = θ_{i, p_{t}, y_{t}} \cdot ψ_{i, p_{t}},

L_{t, i}^{inter} = [β_{i} \cdot ψ_{i, B} \cdot π_{i, j}] \cdot [ψ_{i, r_{t}} \cdot θ_{i, r_{t}, B} \cdot θ_{j, B, y_{t}}] .

L_{t, i}^{inter} = [β_{i} \cdot ψ_{i, B} \cdot π_{i, j}] \cdot [ψ_{i, r_{t}} \cdot θ_{i, r_{t}, B} \cdot θ_{j, B, y_{t}}] .

Ω_{t, i} \propto j = 1 \sum L (L_{i, j}^{intra} \cdot I (i = j) + L_{i, j}^{inter}) \cdot m_{t + 1, t} (j) .

Ω_{t, i} \propto j = 1 \sum L (L_{i, j}^{intra} \cdot I (i = j) + L_{i, j}^{inter}) \cdot m_{t + 1, t} (j) .

ω_{t} \sim \mbox B er (\frac{L _{t}^{intra}}{\sum _{j = 1} L _{t, j}^{inter} + L _{t}^{intra}}),

ω_{t} \sim \mbox B er (\frac{L _{t}^{intra}}{\sum _{j = 1} L _{t, j}^{inter} + L _{t}^{intra}}),

L_{t}^{intra}

L_{t}^{intra}

L_{t, j}^{inter}

p(z_{t}|\bullet)\propto\left\{\begin{array}[]{l l}\mathbb{I}(z_{t}=z_{t-1})&\mbox{if }y_{t}=B\mbox{ or }\omega_{t-1}=0\\ \rho_{t}\cdot m_{t+1,t}(z_{t})&\mbox{if }p_{t}=B\mbox{ or }\omega_{t-1}=1\end{array}\right.

p(z_{t}|\bullet)\propto\left\{\begin{array}[]{l l}\mathbb{I}(z_{t}=z_{t-1})&\mbox{if }y_{t}=B\mbox{ or }\omega_{t-1}=0\\ \rho_{t}\cdot m_{t+1,t}(z_{t})&\mbox{if }p_{t}=B\mbox{ or }\omega_{t-1}=1\end{array}\right.

p (z_{t} ∣ ∙)

p (z_{t} ∣ ∙)

ρ_{t}

z_{t} \sim \mbox M u (\frac{\sum _{i = 1}^{L} p ( z _{t} = i ∣ ∙ ) \mathbbm I ( z _{t} = i )}{\sum _{i = 1}^{L} p ( z _{t} = i ∣ ∙ )}) .

z_{t} \sim \mbox M u (\frac{\sum _{i = 1}^{L} p ( z _{t} = i ∣ ∙ ) \mathbbm I ( z _{t} = i )}{\sum _{i = 1}^{L} p ( z _{t} = i ∣ ∙ )}) .

β

β

π_{i}

ψ_{i}

θ_{i, k}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Mathematical Dynamics and Fractals · Algorithms and Data Compression

Full text

Infinite Mixture Model of Markov Chains

Jan Reubold

Thorsten Strufe

Ulf Brefeld

TU Dresden

Leuphana University

Abstract

We propose a Bayesian nonparametric mixture model for prediction- and information extraction tasks with an efficient inference scheme. It models categorical-valued time series that exhibit dynamics from multiple underlying patterns (e.g. user behavior traces). We simplify the idea of capturing these patterns by hierarchical hidden Markov models (HHMMs) - and extend the existing approaches by the additional representation of structural information. Our empirical results are based on both synthetic- and real world data. They indicate that the results are easily interpretable, and that the model excels at segmentation and prediction performance: it successfully identifies the generating patterns and can be used for effective prediction of future observations.

1 Introduction

Assume that the behavior of users follows intentions, for example in the context of web interaction they may want to look for information on a specific topic or check their e-mails. In order to fulfill an intention, they must complete a number of actions, such as requesting a certain web-page, often in a certain order. If we assume that a similar sequence of actions belongs to the same or a similar intention, we should be able to recognize an intention given a sequence of actions. Furthermore, given an entire data set of such sequences we should be able to identify the intentions themselves by recognizing reoccurring patterns within the data. Generalizing from this idea, one can think of a two-level hierarchy of dynamics. One level representing the sequence of intentions exhibiting so-called high-level dynamics, and one level that represents the sequence of actions performed while fulfilling a specific intention displaying so-called low-level dynamics.

Data that exhibits these complex dynamics following different patterns (intentions) can be observed in various domains. These patterns are commonly referred to as super states (Johnson and Willsky, 2013). In categorical-valued time-series – series of discrete values where the only known relation between different values is the temporal relation – these super states are observed as sub-sequences, called segments. Each time-series can be generated by multiple underlying super states. Therefore, consecutive observations within a segment possess low-level dynamics while transitions between super states, meaning transitions between segments, exhibit so-called high-level dynamics.

Modeling such data, with tasks such as identifying the number of super states and their dynamics within a dataset, is a challenging problem. The models need to be very flexible and, thus, get extremely complex very quickly: Bayesian nonparametric models successfully capture data exhibiting complex low-level dynamics (Fox et al., 2011; Beal and Krishnamurthy, 2012). The general idea is, again, to identify the underlying super states by grouping similar segments. Approaches that aim at grasping dynamics on different levels struggle with either their efficiency (Fine et al., 1998) or flexibility. Nonetheless, such models are crucial to capture natural processes that possess both low- and high-level dynamics, like navigation strategies of users searching for information on the Web (West and Leskovec, 2012) or on Facebook (Paul et al., 2011), human activities of daily living (Duong et al., 2005), natural language (Lee et al., 2013), or motion recognition (Heller et al., 2009).

The goal of this paper is to develop an approach for the segmentation of categorical-valued time-series data that can be used for prediction- and information extraction tasks. Regarding the model, our requirements are as follows: (i) the algorithm should perform a multi-level analysis, covering at least two levels of the dynamics (e.g. number of intentions and their manifestations), (ii) the number of super states should be unbounded (e.g. one cannot set a bound on the number of intentions), (iii) focus on categorical-valued time-series data (sequences of arbitrary length), (iv) possess some predictive capabilities, and (v) yield results that are easy to interpret. The first three requirements relate to the segmentation task, the last two represent equally important requirements for user understanding.

Requirement (ii) suggests using a Bayesian nonparametric treatment. Markov chains (MCs) address (iii) and guarantee a certain amount of predictive power (iv) as well as well interpretable results (v) and a simple inference scheme. Additionally, combining both concepts allows us to perform a two-level analysis of the dynamics of the data (ii). In this paper, we hence propose a Bayesian nonparametric mixture model where each mixture component is represented by a MC. Therefore, the model learns two-level dynamics in an unsupervised fashion and represents each identified super state, encoding (relatively) stable low-level dynamics, by a MC.

The main goal of our research is to enhance both, the prediction of future behavior and the understanding of the dynamics in the context of categorical-valued behavioral data by means of segmentation. Therefore, we evaluate the segmentation performance of our model against synthetic data, to understand its effectiveness and test it for extreme cases. Further, we apply our model to a novel task of user understanding, where we segment behavior traces of users on Facebook to understand their behavior and predict their next moves. Our empirical findings indicate that our model successfully identifies underlying patterns and can effectively be turned into a predictor for future observations.

2 Related Work

Two models that can naturally capture dynamics caused by multiple underlying super states are the standard- and the infinite hierarchical hidden Markov models ([i]HHMM) (Fine et al., 1998; Murphy and Paskin, 2002; Heller et al., 2009). Each hierarchy of a [i]HHMM is a separate hidden Markov model (HMM) with all observations situated in the leaves, called production states. Where the HHMM requires an a-priori fixed number of levels for its hierarchy, the iHHMM allows for a potentially unbounded number that can grow with data. Due to the unbounded depth of the hierarchy of HMMs, these models are highly flexible. Nonetheless, they are rather simple with respect to the structural information used. Each hierarchy consists of HMMs without any further structural information incorporated. To the best of our knowledge, there exists no extension that incorporate additional, structural information due to the complicated and expensive inference in these models. In the classical model the inference scheme rendered the (i)HHMM inapplicable to real-world problems (Fine et al., 1998; Heller et al., 2009), until Wakabayashi and Miura (2012) developed a more efficient one. Due to studies that suggest that two-level analyses of dynamics are sufficient in many real-world applications (Oliver et al., 2004; Nguyen et al., 2005; Xie et al., 2003), related work simplifies the iHHMM by restricting the depth of the hierarchy while integrating additional structural information.

Stepleton et al. (2009) propose a model where the infinite HMM (Beal et al., 2001) (iHMM) is combined with a block-diagonal prior. The model assumes that the transition matrix of the iHMM is comprised of a nearly block-diagonal structure. It groups subsets of hidden states into blocks, generating an unbounded number of blocks. By modifying the Dirichlet process prior over the transitions, the model increases the transition probability of states within a block. Each block can be interpreted as a super state. However, the model cannot handle super states with overlapping categorical-valued state spaces. A similar idea, a bias towards self-transitions within a mixture component of the hierarchical Dirichlet process - HMM (Teh et al., 2006) (HDP-HMM), is an essential part of the sticky HDP-HMM Fox et al. (2011) propose. In similarity to block-diagonal iHMM, successive hidden states in this model favor to belong to the same state. Further, by augmenting the hidden states with an additional layer of states, the sticky HDP-HMM allows to treat the conditional distribution of observations given the states nonparametrically. While the model is able to partition sequences into segments, it is not applicable to categorical-valued time-series, whose values only stand in temporal relations to each other. Furthermore, the model cannot capture any dynamics within a super state.

Studies by Johnson (2014) and Saeedi et al. (2016) explore the benefits of incorporating an explicit state-duration distribution instead of defining some bias towards specific transitions (Fox et al., 2011; Stepleton et al., 2009). Both approaches are Bayesian nonparametric models that apply a two-level analysis of the dynamics within the data. Whereas the model proposed by Johnson (2014) learns a distribution expressing the overall duration of a state, the segmented iHMM (siHMM) (Saeedi et al., 2016) models a state-duration distribution which expresses the probability of changing the current state, conditioned on the current observation and hidden state. Similar to the sticky HDP-HMM, both models cannot capture the dynamics within a super state. In general, none of the existing approaches fulfills all requirements and only the (i)HHMM satisfy our requirements for segmentation (i-iii) without further adaptation.

Finally, Cadez et al. (2000) propose an finite mixture model of Markov chains (FMMC). While, due to its parametric nature, it is not flexible enough for segmentation, the concept behind this algorithm is similar to ours, i.e. a mixture of Markov chains.

Our model combines aspects of both concepts, i.e. it incorporates a bias towards self-transitions as well as a natural state-duration model by identifying the distribution over the start- and end states of each super state. It features a simple inference scheme and fulfills the requirements, e.g. the obtained model inherently features prediction tasks.

3 An Infinite Mixture Model of Markov Chains

In this section we present our main contribution: the infinite mixture model of Markov chains (IMMC). The model applies a two-level analysis to the dynamics of the data. Compared to the HHMM our approach contains a more detailed state transition model for the super states. The augmentation of both the observation- and the latent state layer results in a natural state duration model with state durations based on the structural information of the dynamics within a super state. Note that while this paper focuses on the intended use for categorical-valued time-series, such as user traces on online platforms, it is not restricted to these.

We now give a more formal description of the IMMC. Let $\Sigma$ denote a finite observation space and $\Sigma^{*}$ the set of all sequences of possible combinations over $\Sigma$ . Then, $y^{(s)}$ denotes a finite sequence of observations from $\Sigma^{*}$ , with $s$ as its index. To not clutter the notation unnecessarily, we assume to have a set $\mathbf{Y}$ of $S$ sequences with arbitrary length $T_{s}$ present as a concatenated sequence $\mathbf{y}$ where the sequences from $\mathbf{Y}$ are separated by an auxiliary boundary-symbol $B$ . Therefore, the model can handle sequences of arbitrary length.

The model is comprised of three key parts: (i) The underlying sequence of hidden states assigning an observation to a specific super state is modeled by a HDP-HMM (Fig. 1), the equivalent to the iHMM; (ii) the prior information that successive hidden states are more likely to originate from the same super state is expressed by a self-transition bias (as in Fox et al. (2011)); (iii) finally, to capture the MCs, we augment the layer of the observed states to not represent a single observed state, but transitions between successive observed states. The MCs represent the super states and generate successive sub-states which represent the segments within the sequence of observations. The entire graphical model is depicted in Figure 1 (right).

The HDP-HMM is a HMM combined with a nonparametric prior that is based on a two-level hierarchy of Dirichlet processes (DPs). A DP is a distribution over distributions. A sample from it, DP( $\gamma,H$ ), can be generated by the ‘stick-breaking process’ of Sethuraman (1994). Here, $\gamma$ is called the concentration parameter and $H$ denotes the base measure. The ‘stick-breaking’ process simulates repeatedly breaking a portion from the end of a stick apart. Thinking of the stick as the unit interval, repeatedly breaking a portion of it apart generates a partitioning of the interval, resulting in an infinite set of sub-intervals. Given a positive $\gamma$ , the process $\mbox{SBP}_{1}(\gamma)$ is defined as follows:

[TABLE]

where $\mbox{Beta}(\cdot)$ denotes the Beta distribution, $\beta_{i}^{\prime}$ is the fraction of the remaining stick to break of, and $\beta_{i}$ its total length. $\tilde{\theta}_{i}$ denotes a realization of an i.i.d. draw from the finite measure, $\tilde{\theta}_{i}\sim H$ . A sample from a DP can then be obtained by

[TABLE]

In a hierarchical Dirichlet process (HDP), which consists of a two-level hierarchy of DPs, the realization of one DP $G$ is used as the base measure for all its subordinate DPs, DP( $\alpha,G$ ). Therefore, these DPs represent distributions over distributions over the same categorical, finite space. Instead of applying Equations 1, 2 recursively to sample realizations for both the base DP and its subordinates, Teh et al. (2006) propose an equivalent scheme, that directly takes the sub-intervals $\beta_{i}$ as inputs for the ‘stick-breaking process’ of the subordinates. The modified process $\mbox{SBP}_{2}(\alpha,\beta)$ is given by

[TABLE]

Thus, Equations 1 and 3 are sufficient to realize samples from a HDP. By replacing the set of conditional finite mixture models of the HMM with a HDP, we obtain a nonparametric HMM with an unbounded state space. To address the problem of fast switching between redundant states in the HDP-HMM to avoid slowing mixing rates and a possible decrease in predictive performance (Fox et al., 2011), we make use of the mechanism Fox et al. (2011) propose. Therefore, Equation 3 is slightly modified to incorporate a bias towards self-transitions of states,

[TABLE]

where $\kappa>0$ is the amount added to the $j$ th component and $\bm{\beta}\sim\mbox{SBP}_{1}(\gamma)$ .

The algorithm consists of four layers of states, the hidden states $\mathbf{z}$ and $\mathbf{\omega}$ , the observed sub-states $\mathbf{p}$ , and the partly observed sub-states $\mathbf{y}$ . The hidden state $z_{t}$ represents the active super state at time-step $1\leq t\leq T$ . The hidden state $\omega_{t}$ is either [math] or $1$ and signals the continuation or end of a segment, respectively. Finally, the two sub-state layers, $p_{t}$ and $y_{t}$ , represent the transition from $p_{t}$ , the sub-state of the previous time-step, to the current sub-state $y_{t}$ . The reason for modeling the transition of sub-states is to identify the dynamics within each super state. State $y_{t}$ is defined as partly observed, because we assume that information about the end of a segment is missing in the data. Due to the goal of segmentation, this assumption is necessary.

The resulting generative process is then as follows

[TABLE]

where $\mbox{Mu}(\cdot)$ denotes the Multinomial distribution, $\mbox{Ber}(\cdot)$ the Bernoulli distribution and $\Sigma$ the finite, categorical-valued sub-state space with cardinality $|\Sigma|$ .

Note, that, due to the interpretation of the observed layers, we do not process any observation twofold, but process each onetime, i.e. once as the starting- and once as the end state of a transition.

The resulting graphical model is depicted in Figure 1.

A Blocked Gibbs Sampler

In this section we present a truncated blocked Markov chain Monte Carlo (MCMC) HDP sampling algorithm, similar to the one Fox et al. (2011) propose, to optimize the parameters of our model.

Fox et al. (2011) show that a truncated blocked Gibbs sampler allows to jointly sample hidden states and exploit the Markovian structure. The joint mechanism obtains faster mixing rates than for instance a direct assignment sampler. To sample distributions of theoretically infinite cardinality, we make use of the degree $L$ weak limit approximation (Ishwaran and Zarepour, 2002), where $L$ denotes the maximum cardinality of the approximated distribution. It follows, that in practice $L$ needs to exceed the number of true mixture components. Thus, a DP is approximated by a Dirichlet distribution (Dir), with $\mbox{Dir}(\alpha/L,\dots,\alpha/L)$ . Note that this approximation is commonly used for a simple and more efficient computation (see (Fox et al., 2011)). Kurihara et al. (2007) found little to no practical differences to an inference scheme using no truncation.

The prior distributions $\bm{\beta},\bm{\pi},\bm{\psi},$ and $\bm{\theta}$ are initialized by

[TABLE]

where $1\leq i\leq L$ , $K=|\Sigma|+1$ , and $1\leq k\leq K$ .

To update the prior distributions after each iteration, we have to keep track of state-, as well as sub-state transitions. Therefore, $d_{i}$ stores the number of sub-states assigned to each super state $i$ and $G_{i,k_{1},k_{2}}$ records the number of transitions within super state $i$ , where $k_{1}$ and $k_{2}$ represent the row and column of the transition matrix, i.e. the sub-states of the previous and current time-step, respectively. Finally, $n_{i_{1},i_{2}}$ keeps track of the transitions between super states $i_{1}$ and $i_{2}$ . For each iteration, the auxiliary variables document the assignment step.

Sampling $z_{t}$

We obtain a realization of the hidden states $z_{t}$ by adapting the Baum-Welch algorithm. For the first pass, applying the algorithm backward in time, from the last to the first observation of the input sequence, we obtain the backward probabilities $m_{t,t-1}$ :

[TABLE]

At the beginning and end of a new sequence ( $r_{t}=B$ and $y_{t}=B$ ) the message of the successive time-step is passed backward. In case of the latter it is weighted with the likelihood of seeing the beginning of a segment instantiated by super state $i$ given $p_{t}$ , the observed state of the previous time-step. Within a sequence, the algorithm has to account for both intra- and inter-transitions. Hereby, intra-transitions account for sub-state transition within a super state, and inter-transitions for state transitions between super states. Therefore, $\Omega_{t,i}$ computes the likelihood of an intra-transition,

[TABLE]

as well as the probability of an inter-transition,

[TABLE]

Here, the first part, $\beta_{i}\cdot\psi_{i,B}\cdot\pi_{i,j}$ , represents the prior probability of observing an inter-transition from super state $i$ to $j$ . The likelihood of the inter-transition is then expressed by $\psi_{i,r_{t}}\cdot\theta_{i,r_{t},B}\cdot\theta_{j,B,y_{t}}$ .

Given both probabilities, $\Omega_{t,i}$ is computed as follows,

[TABLE]

In the forward pass of the Baum-Welch algorithm, we have to compute the state-probability at each time-step $t$ conditioned on the hidden state of the previous time-step $z_{t-1}$ , the state transition indicator $\omega_{t-1}$ , and the backward probabilities $\bm{m_{t+1,t}}$ . Therefore, we first have to compute

[TABLE]

where

[TABLE]

If $y_{t}$ or $p_{t}$ is the boundary state, $\omega_{t}$ is set to [math] or $1$ , respectively.

Given a realization of $\omega_{t}$ , we can compute the probability distribution over the latent states at time-step $t$ by

[TABLE]

with

[TABLE]

Finally, the assignments are sampled from the computed probability distribution for $z_{t}$ ,

[TABLE]

During the sampling process, the auxiliary variables keep track of the sufficient statistics to update the prior distributions afterwards. Given a realization of $\bm{z}$ , the prior distributions of the parameters are updated accordingly,

[TABLE]

where $G_{i,\cdot,j}$ denotes the count of element $j\in\Sigma$ in super state $i$ , $G_{i,\cdot,j}=\sum_{k=1}^{K}G_{i,k,j}$ .

Algorithm 1 summarizes the entire blocked Gibbs sampler.

4 Experiments

4.1 Synthetic Data

We evaluate the segmentation performance of our model to understand its effectiveness and test it for extreme cases. 111The source code is available at: $<$ anonymized $>$ . Therefore, we apply our model to three synthetic test cases which consist of generating processes, each emulating a different super state.

These test cases differ primarily in their level of difficulty of identifying the processes (super states) correctly, with test case I being the least difficult and test case III the most difficult. Specifically, test case I is comprised of processes with no overlapping state spaces, meaning each state belongs to exactly one super state, while test case III features processes of completely overlapping state spaces that only differ in their inner dynamics, with test case II presenting both scenarios. Given the synthetic nature of our test we can accurately evaluate the segmentation performance of our approach.

For each test-case, we generate three synthetic data sets to assess the performance of the algorithm with different amounts of data. Figure 2 shows the generative processes of test case III where states are indexed by hexadecimal numbers. Realizations of these processes are sampled as segments and combined into sequences. The data sets are comprised of a set of these sequences which sum up to a total amount of $2,500$ , $25,000$ , and $250,000$ observations, respectively.

Segmentation performance.

In order to assess the segmentation performance of our algorithm, we evaluate the precision of identifying the processes and observation assignments of the synthetic data sets. As a baseline, we compare our results against those of the HHMM (Wakabayashi and Miura, 2012). HHMMs are a logical choice as they fulfill our requirements for the segmentation (i-iii). Additionally, we also consider FMMC (Cadez et al., 2000) as a baseline due to its close proximity in concept to IMMC. This approach represents a parametric interpretation of mixture models of Markov chains. Due to its lack of flexibility, it is unable to actually segment sequences, but it rather clusters them. Thus, we provide information on segment boundaries to this baseline. We ran the algorithm ten times with varying cluster initializations for each recorded result of our algorithm and the HHMM, and only selected the best result of FMMC for comparison. For HHMM we performed a grid-search to determine the optimal size of the state space. Each HHMM model was trained in $1000$ iterations. It is to note, that we achieved the best results with a larger state space than the actual one. For IMMC we report on results based on $250$ iterations with a burn-in phase of $250$ iterations. All results are reported as the average of $10$ recorded runs.

Table 1 depicts error rates for the segmentation task. Even though FMMC has additional information, our approach outperforms it in both test case I and III. It seems that the provided segment boundary information is even more vital in test case II than in the other ones, as the first super state (see Figure 3) consists of two loosely connected sub-graphs.

IMMC performs equally well over all data sets of any specific test case. Its performance seems unaffected by the amount of data provided.

For test case I, whose purpose is to evaluate the basic segmentation ability, the HHMM achieves a perfect result, closely followed by our algorithm. While scoring the perfect result on both the small- and mid-sized data set, the HHMM struggles with the large data set where its performance drops drastically.

Test case II demands a segmentation based on not only the distribution over sub-state spaces, but also on the dynamics within a super state. While the performance of our algorithm only slightly decreases (accuracy of $3.35\%$ / $3.17\%$ / $3.21\%$ ), the HHMM struggles with the more detailed segmentation task, achieving an accuracy of $8.99\%$ / $14.28\%$ / $11.13\%$ , respectively. A reason that our algorithm performs worse than the FMMC (also IMMCs poorest performance over all data sets) seems to be the sloppy designed generating process (Figure 3 (first from left)) which contains two loosely connected sub-graphs.

Test case III heavily focuses on segmentation based on the inherent dynamics of the super states. Again, our algorithm outperforms the HHMM, scoring almost perfect accuracy results. In general, the results confirm the ability of HHMMs to perform basic segmentation tasks. Nonetheless, the algorithm seems to struggle with an increased complexity in the data induced by increasing the amount of data, as well as with segmentation tasks demanding distinction by both, sub-state space distributions and dynamics within super states.

Our algorithm performs consistently well over all test cases and data set sizes. On the data sets of test case III, it significantly outperforms both, the HHMM and the FMMC. Of further note is the insensibility to the set of hyper-parameters in our model, meaning rule-of-thumb adjustments should suffice. For the evaluations we used the same hyper-parameter values for all test cases over all data sets.

4.2 User Navigation on Facebook

The data set for the next evaluation contains user navigation data from Facebook (Paul et al., 2011). For each user, the invoked pages are recorded and grouped into sessions. Examples for such invoked pages are ’Login’, ’Newsfeed’, ’Load more news’, ’Like’, etc. The dataset contains $152$ unique invoked pages, $49,479$ sessions of $2,749$ users, and $8,197,308$ observations. Every session is interpreted as a sequence of observations.

Prediction performance.

To show the applicability of IMMC in the context of real-world applications, we measure its prediction performance on the Facebook data set. Therefore, we split the Facebook data into a training- and a test set using $90\%$ of the data for training and $10\%$ for testing. Furthermore, we cut each sequence of the test set at a randomly sampled position $c$ and use the sub-sequence $y^{(s)}_{1:c}$ as input to the model. The ground-truth for the prediction is the observation at position $c+1$ . This situation simulates the prediction of future observations in a sequence given only past and present observations. For the prediction process, we learn a model of the underlying super states given the training set. Conditioned on the observed sub-sequence, $y^{(s)}_{1:c}$ , we compute the MAP estimate of the next state of the sequence based on the likelihoods of all super states and the transition probabilities within each super state from the most recent state to all possible future states.

The evaluation is performed three times: once on the entire data set, once on $10\%$ of the data, and once on $1\%$ of it. To show the influence of the detailed partition our algorithm applies to the data, we compare it to FMMC and to global Markov models of different orders. For FMMC we performed a grid-search to find the optimal size of the state space, i.e. the optimal number of MCs.

Whereas the MMs (order $\leq 9$ ) achieved an accuracy of $\approx 1.0\%$ on the entire data set, FMMC predicts $9.84\%$ of the cases correctly. Our algorithm, representing a more flexible version of FMMC, outperforms the other algorithms significantly. It results in a model with $61.41\%$ prediction-accuracy on the entire data set and slightly decreased performances on the smaller data sets, i.e. $57.81\%$ and $54.34\%$ on $10\%$ and $1\%$ of the data, respectively.

Runtime.

Another important advantage to note is the computational efficiency of our algorithm compared to HHMMs. When evaluating the prediction performance of both algorithms on the Facebook data set, we noticed the significantly higher runtime of the HHMM. The evaluations were performed on a PC with an Intel Core i5-6600K CPU @ 3.50GHz and 4 cores, 32GB of RAM, a SSD, and a $64$ -bit system. While both algorithm had almost identical computation times for a single iteration of less than a second on the mid-sized synthetic data sets, we terminated the computation of an iteration of the HHMM on the entire Facebook data set after several days. IMMC computed an iteration on the same data set in $\approx 5,379s$ . Computation times on $1\%$ of the Facebook data were $\approx 2,985s$ and $\approx 22s$ for the HHMM and our algorithm, respectively.

Given the unreliability of HHMMs with a low number of iterations and impractically high runtime of a sufficient number of iterations ( $1\%$ of the Facebook data @ 1000 iterations: $>30$ days), we were unable to even match the prediction performance of simple MMs.

In addition to the fast computation times of our algorithm we also obtain a fast convergence rate. On the synthetic data set the results were largely converged after only $40$ iterations (+ $40$ burn-in iterations) while the learning process could be terminated on the Facebook data set after only $20$ iterations (+ $20$ burn-in iterations). The code of the HHMM was provided by Wakabayashi and Miura (2012).

Interpretability.

Finally, we demonstrate how the model can be applied for information extraction tasks. This is especially useful for tasks that come with no or only little prior knowledge. Being a nonparametric model that adjusts its complexity to the data, our approach is a promising candidate for such tasks. Additionally, representing clusters by Markov models makes it easy to interpret the resulting segments. Figure 4 depicts three frequently observed behavioral patterns of users on Facebook. (1) shows a user checking for updates on the newsfeed or waiting for new messages. The user activates the Facebook tab and without doing any additional activity deactivates it shortly after. (2) represents users communicating with each other. (3) shows users who are interested in updates of their friends. After activating the Facebook tab, scrolling the newsfeed and visiting specific newsfeed entries, users deactivate the tab again. These types of segments represent user behavior focused on specific tasks. Our results give a detailed insight in how users interact on Facebook.

5 Conclusion

We presented a Bayesian nonparametric approach to perform a two-level analysis of the dynamics in categorical-valued time-series. By interpreting the two levels as the hidden states of an unbounded mixture model and the super states represented by Markov chains as its mixture components, our model showed significant improvements over related approaches when analyzing categorical-valued time-series. We obtained a natural state-duration model by augmenting both the hidden- and the observed layer of states. The hereby increased detail of the model allowed us to capture state durations based on the dynamics of the super states. Furthermore, by representing each super state by a Markov chain we obtained a model that yields easily interpretable low-level dynamics of the super states and achieves a highly accurate prediction rate. Thus, the model inherently is applicable to prediction- and information extraction tasks.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Beal and Krishnamurthy (2012) Matthew Beal and Praveen Krishnamurthy. Gene expression time course clustering with countably infinite hidden markov models. ar Xiv preprint ar Xiv:1206.6824 , 2012.
2Beal et al. (2001) Matthew J Beal, Zoubin Ghahramani, and Carl E Rasmussen. The infinite hidden markov model. In Advances in neural information processing systems , pages 577–584, 2001.
3Cadez et al. (2000) Igor Cadez, David Heckerman, Christopher Meek, Padhraic Smyth, and Steven White. Visualization of navigation patterns on a web site using model-based clustering. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining , pages 280–284, 2000.
4Duong et al. (2005) Thi V. Duong, Hung H. Bui, Dinh Q. Phung, and Svetha Venkatesh. Activity recognition and abnormality detection with the switching hidden semi-markov model. In Computer Vision and Pattern Recognition, 2005 , volume 1, pages 838–845. IEEE, 2005.
5Fine et al. (1998) Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical hidden markov model: Analysis and applications. Machine learning , 32(1):41–62, 1998.
6Fox et al. (2011) Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. A sticky hdp-hmm with application to speaker diarization. The Annals of Applied Statistics , pages 1020–1056, 2011.
7Heller et al. (2009) Katherine A. Heller, Yee W. Teh, and Dilan Görür. Infinite hierarchical hidden Markov models. In International Conference on Artificial Intelligence and Statistics , pages 224–231, 2009.
8Ishwaran and Zarepour (2002) Hemant Ishwaran and Mahmoud Zarepour. Exact and approximate sum representations for the dirichlet process. The Canadian Journal of Statistics/La Revue Canadienne de Statistique , pages 269–283, 2002.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 Introduction

2 Related Work

3 An Infinite Mixture Model of Markov Chains

A Blocked Gibbs Sampler

Sampling ztz_{t}zt​

4 Experiments

4.1 Synthetic Data

Segmentation performance.

4.2 User Navigation on Facebook

Prediction performance.

Runtime.

Interpretability.

5 Conclusion

Sampling $z_{t}$