Hierarchical Continuous Time Hidden Markov Model, with Application in   Zero-Inflated Accelerometer Data

Zekun Xu; Eric B. Laber; Ana-Maria Staicu

arXiv:1812.01162·stat.CO·June 12, 2020

Hierarchical Continuous Time Hidden Markov Model, with Application in Zero-Inflated Accelerometer Data

Zekun Xu, Eric B. Laber, Ana-Maria Staicu

PDF

TL;DR

This paper introduces a hierarchical continuous-time hidden Markov model tailored for high-dimensional, zero-inflated accelerometer data, enabling extraction of meaningful activity patterns to inform health-related decisions.

Contribution

It presents a novel flexible model with an efficient estimation algorithm and bootstrap-based interval estimation for analyzing complex accelerometer data.

Findings

01

Successfully applied to NHANES data sets

02

Effectively captures activity patterns with zero-inflation

03

Provides reliable interval estimates

Abstract

Wearable devices including accelerometers are increasingly being used to collect high-frequency human activity data in situ. There is tremendous potential to use such data to inform medical decision making and public health policies. However, modeling such data is challenging as they are high-dimensional, heterogeneous, and subject to informative missingness, e.g., zero readings when the device is removed by the participant. We propose a flexible and extensible continuous-time hidden Markov model to extract meaningful activity patterns from human accelerometer data. To facilitate estimation with massive data we derive an efficient learning algorithm that exploits the hierarchical structure of the parameters indexing the proposed model. We also propose a bootstrap procedure for interval estimation. The proposed methods are illustrated using data from the 2003 - 2004 and 2005 - 2006…

Tables4

Table 1. Table 1: The bias and standard error for the estimated population and subgroup-specific parameters in HCTHMM. The metric is Euclidean norm for population parameters, and Frobenius norm for subgroup-specific parameter vectors.

Parameter	n=20	n=200
	Bias (s.e.)	Bias (s.e.)
Population parameters:
Slope for State 1 zero odds	.0021 (.0422)	.0032 (.0134)
Slope for State 1 Poisson mean	.0001 (.0034)	.0001 (.0011)
Slope for State 2 Poisson mean	.0016 (.0053)	.0019 (.0018)
Slope for State 3 Poisson mean	.0036 (.0120)	.0046 (.0044)
Subgroup-specific parameters:
Initial probabilities (Male)	.0169 (.0022)	.0165 (.0008)
Initial probabilities (Female)	.0152 (.0069)	.0284 (.0038)
Transition rates (Male)	.0038 (.0029)	.0011 (.0006)
Transition rates (Female)	.0099 (.0089)	.0016 (.0012)

Table 2. Table 2: Comparisons on the mean coverage probability for the 95% and 99% bootstrap confidence intervals for the mean proportion of time in each latent state between subject-specific HMM and HCTHMM (n = 200).

State	Subject-specific HMM		HCTHMM
	$M a l e$	$F e m a l e$	$M a l e$	$F e m a l e$
95% C.I.:
1	.942	.942	.954	.952
2	.944	.938	.960	.950
3	.946	.936	.944	.962
99% C.I. :
1	.982	.988	.986	.994
2	.986	.984	.986	.988
3	.978	.980	.994	.992

Table 3. Table 3: The definition of missing interval in terms of consecutive minutes of zeros in the literature on human activity.

Literature	Definition of missing
Cradock et al. (2004)	30 minutes
Catellier et al. (2005)	20 minutes
Troiano et al. (2008)	60 minutes
Robertson et al. (2010)	20 minutes
Evenson (2011)	20 minutes
Schmid et al. (2015)	60 minutes
Lee and Gill (2016)	20 minutes

Table 4. Table 4: Summary of BIC from model selection. In type I models, all parameters are subject-specific. In type II models, all parameters are subject-specific except the slopes, which are population parameters. In type III models, the intercepts are subject-specific; the slopes are population; the initial probabilities and transitions are subgroup-specific. In type IV models, all parameters are subgroup-specific except the intercepts, which are subject-specific.

Model specifications	BIC
5 states, type I	248,009,082
6 states, type I	202,804,081
7 states, type I	203,261,150
6 states, type II	200,457,217
6 states, type III	198,808,738
6 states, type IV	198,807,080

Equations53

g_{i, 1} (y; x) = δ_{i} (x) 1_{y = 0} + {1 - δ_{i} (x)} \frac{{ λ _{i, 1} ( x ) } ^{y} exp { λ _{i, 1} ( x ) }}{y !} .

g_{i, 1} (y; x) = δ_{i} (x) 1_{y = 0} + {1 - δ_{i} (x)} \frac{{ λ _{i, 1} ( x ) } ^{y} exp { λ _{i, 1} ( x ) }}{y !} .

g_{i, m} (y; x) = \frac{{ λ _{i, m} ( x ) } ^{y} exp { λ _{i, m} ( x ) }}{y !} .

g_{i, m} (y; x) = \frac{{ λ _{i, m} ( x ) } ^{y} exp { λ _{i, m} ( x ) }}{y !} .

lo g {\frac{δ _{i} ( x )}{1 - δ _{i} ( x )}}

lo g {\frac{δ _{i} ( x )}{1 - δ _{i} ( x )}}

lo g {λ_{i, m} (x)}

α_{i}^{T_{k}} (m; θ_{i}) ≜

α_{i}^{T_{k}} (m; θ_{i}) ≜

\displaystyle\ldots,\boldsymbol{X}_{i}(T_{k})={\bm{\mathbf{{x}}}}_{T_{k}}\bigg{\}}.

α_{i}^{T_{1}} (m; θ_{i}) =

α_{i}^{T_{1}} (m; θ_{i}) =

α_{i}^{T_{k + 1}} (m; θ_{i}) =

x_{T_{k + 1}}, θ_{i}},

β_{i}^{T_{k}} (m; θ_{i}) ≜

β_{i}^{T_{k}} (m; θ_{i}) ≜

\displaystyle\boldsymbol{X}_{i}(T_{k+1})={\bm{\mathbf{{x}}}}_{T_{k+1}},\ldots,\boldsymbol{X}_{i}(T_{K})={\bm{\mathbf{{x}}}}_{T_{K}}\bigg{\}}.

β_{i}^{T_{K}} (m; θ_{i}) =

β_{i}^{T_{K}} (m; θ_{i}) =

β_{i}^{T_{k}} (m; θ_{i}) =

β_{i}^{T_{k + 1}} (ℓ; θ_{i}),

γ_{i}^{t} (m; θ_{i})

γ_{i}^{t} (m; θ_{i})

= \frac{α _{i}^{t} ( m ; θ _{i} ) β _{i}^{t} ( m ; θ _{i} )}{\sum _{m = 1}^{M} α _{i}^{t} ( m ; θ _{i} ) β _{i}^{t} ( m ; θ _{i} )},

ϕ_{j} (m; θ)

ϕ_{j} (m; θ)

where η_{i} (m; θ_{i})

θ min f (θ) s . t . D θ = 0.

θ min f (θ) s . t . D θ = 0.

θ, z min f (θ)

θ, z min f (θ)

s . t . A_{i} θ_{i} = B_{i} z, i = 1, \dots, n,

L_{ρ} (θ, z, ξ)

L_{ρ} (θ, z, ξ)

= i = 1 \sum n {f_{i} (θ_{i}) + ξ_{i}^{T} (A_{i} θ_{i} - B_{i} z) + \frac{ρ}{2} ∥ A_{i} θ_{i} - B_{i} z ∥_{2}^{2}},

ξ = [ξ_{1}^{'}, \dots, ξ_{n}^{'}],

ξ = [ξ_{1}^{'}, \dots, ξ_{n}^{'}],

A = A_{1} 0 ⋮ 0 0 A_{2} 0 \dots \dots ⋱ \dots 000 A_{n}, B = B_{1} B_{2} ⋮ B_{n} .

θ - update: \nabla f (\tilde{θ}_{n}^{(v + 1)}) + A^{T} \tilde{ξ}_{n}^{(v)} + ρ A^{T} (A \tilde{θ}_{n}^{(v + 1)} - B \tilde{z}_{n}^{(v + 1)}) = 0,

θ - update: \nabla f (\tilde{θ}_{n}^{(v + 1)}) + A^{T} \tilde{ξ}_{n}^{(v)} + ρ A^{T} (A \tilde{θ}_{n}^{(v + 1)} - B \tilde{z}_{n}^{(v + 1)}) = 0,

z - update: B^{T} \tilde{ξ}_{n}^{(v)} + ρ B^{T} (A \tilde{θ}_{n}^{(v + 1)} - B \tilde{z}_{n}^{(v)}) = 0,

ξ - update: \tilde{ξ}_{n}^{(v + 1)} - \tilde{ξ}_{n}^{(v)} - ρ (A \tilde{θ}_{n}^{(v + 1)} - B \tilde{z}_{n}^{(v + 1)}) = 0,

θ_{i} - update: \nabla f_{i} (\tilde{θ}_{i, n}^{(v + 1)}) + A_{i}^{T} \tilde{ξ}_{i, n}^{(v)} + ρ A_{i}^{T} (A_{i} \tilde{θ}_{i}^{(v + 1)} - B_{i} \tilde{z}_{n}^{(v)}) = 0.

θ_{i} - update: \nabla f_{i} (\tilde{θ}_{i, n}^{(v + 1)}) + A_{i}^{T} \tilde{ξ}_{i, n}^{(v)} + ρ A_{i}^{T} (A_{i} \tilde{θ}_{i}^{(v + 1)} - B_{i} \tilde{z}_{n}^{(v)}) = 0.

lo g {\frac{δ _{i} ( x )}{1 - δ _{i} ( x )}}

lo g {\frac{δ _{i} ( x )}{1 - δ _{i} ( x )}}

lo g {λ_{i, 1} (x)}

lo g {λ_{i, 2} (x)}

lo g {λ_{i, 3} (x)}

- U_{5} - U_{6} U_{7} U_{9} U_{5} - U_{7} - U_{8} U_{10} U_{6} U_{8} - U_{9} - U_{10},

- U_{5} - U_{6} U_{7} U_{9} U_{5} - U_{7} - U_{8} U_{10} U_{6} U_{8} - U_{9} - U_{10},

- U_{11} - U_{12} U_{13} U_{15} U_{11} - U_{13} - U_{14} U_{16} U_{12} U_{14} - U_{15} - U_{16},

- U_{11} - U_{12} U_{13} U_{15} U_{11} - U_{13} - U_{14} U_{16} U_{12} U_{14} - U_{15} - U_{16},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Hierarchical Continuous Time Hidden Markov Model, with Application in Zero-Inflated Accelerometer Data

Zekun Xu

Department of Statistics, North Carolina State University (Email: [email protected])

Eric B. Laber

Department of Statistics, North Carolina State University (Email: [email protected])

Ana-Maria Staicu

Department of Statistics, North Carolina State University (Email: [email protected])

Abstract

Wearable devices including accelerometers are increasingly being used to collect high-frequency human activity data in situ. There is tremendous potential to use such data to inform medical decision making and public health policies. However, modeling such data is challenging as they are high-dimensional, heterogeneous, and subject to informative missingness, e.g., zero readings when the device is removed by the participant. We propose a flexible and extensible continuous-time hidden Markov model to extract meaningful activity patterns from human accelerometer data. To facilitate estimation with massive data we derive an efficient learning algorithm that exploits the hierarchical structure of the parameters indexing the proposed model. We also propose a bootstrap procedure for interval estimation. The proposed methods are illustrated using data from the 2003 - 2004 and 2005 - 2006 National Health and Nutrition Examination Survey.

Keywords: Continuous-time hidden Markov model; Consensus optimization; Accelerometer data.

1 Introduction

The development of the wearable technology has given rise to a variety of sensing devices and modalities. Some of these devices, e.g., Fitbit or Apple Watch, can be worn continuously and thereby produce huge volumes of high-frequency human activity data. Because these data present little burden on the wearer to collect and provide rich information on the in situ behavior of the wearer, they have tremendous potential to inform decision making in healthcare. Examples of remote sensing data in healthcare include elder care, remote monitoring of chronic disease, and addition management (Bartalesi et al., 2005; Marshall et al., 2007; Hansen et al., 2012). Accelerometers are among the most commonly used and most widely studied types of wearable devices, they have been used both in randomized clinical trials to evaluate treatment effect on activity-related impairment (Napolitano et al., 2010; Kanai et al., 2018) and in observational studies to characterize activity patterns in a free-living environment (Morris et al., 2006; Hansen et al., 2012; Xiao et al., 2014). However, despite rapidly growing interest and investment in wearable devices for the study of human activity data, a general and extensible class of models for analysis of the resulting data is lacking.

We propose a continuous-time hidden Markov model for the modeling human accelerometer data that aligns with scientific (conceptual) models of human activity data; in the proposed model, latent states correspond to latent (unobserved) activities, e.g., resting, running, jumping, etc., that are shared across the population yet the accelerometer signatures within these activities are allowed to vary across subjects. Furthemore, differences across subgroups, e.g., defined by sex, age, or the presence-absence of a comorbid condition, can be identified by aggregating individual-level effects across these groups. This work is motivated in part by the physical activity data set from the 2003 - 2004 National Health and Nutrition Examination Survey (NHANES). In this study, human activity patterns were measured at one-minute intervals for up to seven days using the ActiGraph Model 7164 accelerometer (Troiano et al., 2008; Metzger et al., 2008; Schmid et al., 2015). Activity for each minute was recorded as an integer-valued intensity-level commonly referred to as an activity count. In the study, subjects were instructed to remove the device during sleep or while washing (to keep it dry). Therefore, the observed data comprise high-frequency, integer-valued activity counts for each subject with intervals of missing values corresponding to when the device was removed.

The goal of paper is to use the observed data to characterize activity patterns of each subject, subjects within pre-defined subgroups, and the population as a whole. This is important because the estimated physical activity model can potentially serve the following three purposes. On the subject level, the estimated activity model can be used both for the prediction of future activities and the imputation of missing activity readings. On the subgroup level, the estimated activity patterns provide useful insights into clustering people based on their activity profiles. On the population level, public policies can be designed based on the estimated activity model so as to encourage everyday exercise and healthy life style.

Prior work on modeling activity counts has focused on aggregation and other smoothing techniques. One common approach is to average the activity counts over time for each subject and then compare the group means using two-sample t-tests (Troiano et al., 2008), analysis of covariance (Hansen et al., 2012), or linear mixed effects models (Cradock et al., 2004). However, in these approaches, averaging focuses on overall activity levels and may obscure trends in activity type, activity duration, and transitions between activities. Another approach is to use functional data analysis methods wherein the integer activity counts are first log-transformed to fix the right-skewedness in the distribution and the transformed activity modeled as a function of time of day and other covariates (Morris et al., 2006; Xiao et al., 2014; Gruen et al., 2017). These approaches are best-suited for the identification of smooth, cyclical patterns in the data whereas the observed accelerometer data are characterized by abrupt (i.e., non-smooth) changes in activity levels.

Discrete-time hidden Markov models are another common approach to the analysis of mobility data measured by wearable devices (He et al., 2007; Nickel et al., 2011; Ronao and Cho, 2014; Witowski et al., 2014). In these models, activity is partitioned into different latent behavioral states and the observed activity count is dependent on the unobserved latent activity. The latent states evolve according to a discrete-time Markov process and a primary goal is the correct classification of the latent activity. To construct and validate these models requires training data that are labeled by latent activity. However, the NHANES data, like many accelerometer studies, are not labeled by activity. Furthermore, our goal is to identify the dynamics of a patients evolution through these latent activities including activity duration, activity intensity, and transitions between activities. Discrete-time hidden Markov models have been used to model latent health states and subsequently conduct inference for activity patterns within each state (Scott et al., 2005; Altman, 2007; Shirley et al., 2010), but the time scales in these applications are rather coarse (daily or weekly). By contrast, the physical activity in the NHANES data is measured for each minute; this results in a much larger data volume and the ability to provide are more complete picture of activity dynamics.

A technical limitation of the discrete-time approach, is that it assumes that the observations are equally spaced in time. Continuous-time hidden Markov models (CTHMM) have been used to analyze the irregularly-sampled temporal measurements (Nodelman et al., 2012; Wang et al., 2014; Liu et al., 2015). The flexibility of the CTHMM comes at the expense of increased computational cost, which makes it infeasible for large datasets without modification. Liu et al. (2015) developed an efficient learning algorithm for parameter estimation in the CTHMMs. However, this algorithm is only suitable for either the completely pooled or unpooled cases wherein all subjects are assumed to be either completely homogeneous so that they share the same parameters, or completely heterogeneous so that all parameters are subject-specific. Moreover, the algorithm cannot estimate the effects of subject covariates and environmental factors on activity counts.

We propose to model minute-by-minute accelerometer data using a hierarchical continuous-time hidden Markov model (HCTHMM). This model is aligned with scientific models of activity as the latent states represent different types of unobserved physical activities. The continuous-time Markov process for the latent states evolution avoids having to perform imputation for missing yet allows for the possibility that the temporal measurements are irregularly spaced. Furthermore, the proposed model can incorporate both baseline subject covariates and time-varying environmental factors. The proposed model is hierarchical in that it is parameterized by: (i) subject-specific parameters to account for variability between subjects; (ii) subgroup-specific parameters parameters to account for similarity in activity patterns within groups; (iii) population parameters that are common across all subjects. This specification allows us to pool information on some parameters while retaining between-group and between-subject variability. We proposed an estimator of these parameters that is based on consensus optimization using the alternating direction method of multipliers (ADMM). There is a vast literature on the convergence properties of ADMM (Boyd et al., 2011; Shi et al., 2014; Hong and Luo, 2017) which can be readily ported to the proposed algorithm. Finally, we use the nonparametric bootstrap (Efron, 1992) to estimate the sampling distributions of parameter estimators and to conduct statistical inference.

2 Model Framework

We assume that the observed data are of the form $\{{\bm{\mathbf{{W}}}}_{i},Y_{i}({\bm{\mathbf{{T}}}}_{i}),{\bm{\mathbf{{X}}}}_{i}({\bm{\mathbf{{T}}}}_{i})\}_{i=1}^{n}$ , which comprise $n$ independent copies of the trajectory $\{{\bm{\mathbf{{W}}}},Y({\bm{\mathbf{{T}}}}),{\bm{\mathbf{{X}}}}({\bm{\mathbf{{T}}}})\}$ , where: ${\bm{\mathbf{{W}}}}\in\mathbb{R}^{p}$ are baseline subject characteristics; $Y({\bm{\mathbf{{T}}}})=\{Y(T_{1}),\ldots,Y(T_{K})\}$ are the non-negative integer activity counts at times ${\bm{\mathbf{{T}}}}=(T_{1},\ldots,T_{K})\in[0,1]^{K}$ ; and $\mathbf{X}({\bm{\mathbf{{T}}}})=\{{\bm{\mathbf{{X}}}}(T_{1}),\ldots,{\bm{\mathbf{{X}}}}(T_{K})\}$ are concurrent environmental factors such that ${\bm{\mathbf{{X}}}}(\cdot)\in\mathbb{R}^{q}$ . Both ${\bm{\mathbf{{T}}}}$ and $K$ are treated as random variables as the number and timing of observations vary across subjects. We model the evolution of the observed data using a hierarchical continuous-time hidden Markov model (HCTHMM), which we will develop over the remainder of this section.

Let $S_{i}(t)\in\{1,\ldots,M\}$ denote the unobserved latent state for subject $i=1,\ldots,n$ at time $t\in[0,1]$ . The latent state evolves according to a Markov process indexed by: (i) an initial state distribution $\pi_{i}(m)\triangleq P\{S_{i}(0)=m)\}$ for $m=1,\ldots,M$ such that $\sum_{m=1}^{M}\pi_{i}(m)=1$ ; (ii) a transition rate matrix ${\bm{\mathbf{{Q}}}}_{i}=\{q_{i}(m,\ell)\}_{m,\ell=1,\ldots,M}$ such that $q_{i}(m,m)=-\sum_{\ell\neq m}q_{i}(m,\ell)$ . The transition rate matrix, also known as the infinitesimal generator matrix, describes the rate of movements between states in a continuous-time Markov chain (Pyke, 1961a, b; Albert, 1962); the transition probabilities are derived from the transition rates through a matrix exponential operation such that for $k=1,\ldots,K-1$ and $t>u$ , $P_{i}^{t-u}(m,\ell)\triangleq P\{S_{i}(t)=\ell|S_{i}(u)=m\}=\{e^{(t-u){\bm{\mathbf{{Q}}}}_{i}}\}_{m,\ell}$ .

We assume that the conditional distribution of the activity counts is homogeneous in time given the current latent state and environmental factors (to streamline notation, we include baseline characteristics in the time-varying environmental factors). For $i=1,\ldots,n$ and $m=1,\ldots,M$ define $g_{i,m}(y;\mathbf{x})\triangleq P\{Y_{i}(t)=y|S_{i}(t)=m,\mathbf{X}_{i}(t)=\mathbf{x}\}$ . Because longitudinal activity count data are zero-inflated, we set $g_{i,1}(y;\mathbf{x})$ to be the probability mass function for a zero-inflated Poisson distribution with structural zero proportion $\delta_{i}$ and mean $\lambda_{i,1}$ for state 1 such that

[TABLE]

For $m=2,\ldots,M$ we model the activity counts using a Poisson regression model so that

[TABLE]

For each subject $i$ , and latent state $m$ , we assume that functions $\delta_{i}(\mathbf{x})$ and $\lambda_{i,m}(\mathbf{x})$ are of the form

[TABLE]

where $b_{i,0,0},\ldots,b_{i,M,0}$ and $\mathbf{b}_{i,0,1},\ldots,\mathbf{b}_{i,M,1}$ are unknown coefficients.

In the foregoing model description, all parameters are subject-specific so that each subject’s trajectory can be modeled separately. However, in the HCTHMM, some of the parameters are shared among pre-defined subgroups of the subjects. We assume that subjects are partitioned into $J$ such subgroups based on their baseline characteristics $\mathbf{W}$ . For example, these groups might be determined by age and sex. Subjects within the same group are though to behave more similarly to each other than across groups. Let $G_{i}\in\{1,\ldots,J\}$ be the subgroup to which subject $i$ belongs, and let $n_{j}$ denote the number of subjects in group $j=1,\ldots,J$ .

The HCTHMM is a flexible multilevel model in that it allows for three levels of of parameters: (i) subject-specific, (ii) subgroup-specific, (iii) population-level. For example, one might let the intercepts in the generalized linear models for state-dependent parameters be subject-specific to account for the between-subject variability; let the initial state probabilities and the transition rate parameters depend on group-membership, i.e. ${\bm{\mathbf{{\pi}}}}_{i_{1}}={\bm{\mathbf{{\pi}}}}_{i_{2}}$ and ${\bm{\mathbf{{Q}}}}_{i_{1}}={\bm{\mathbf{{Q}}}}_{i_{2}}$ for all $i_{1},i_{2}$ such that $G_{i_{1}}=G_{i_{2}}$ ; and let the slope parameters in the generalized linear models for state-dependent parameters be common across all subjects.

If all the observed time points are equally spaced and all the parameters are subject-specific, the HCTHMM reduces to the subject-specific zero-inflated Poisson hidden Markov model. If there are no covariates and all parameters are common for all subjects, the HCTHMM reduces to a zero-inflated variant of the continuous-time hidden Markov model (Liu et al., 2015). The extension from the previous models to the HCTHMM better matches the scientific goals associated with analyzing the NHANES data but also requires new methods for estimation. Because of the hierarchical structure in the parameters, joint parameter estimation is no longer embarrassingly parallelizable as it would be in the case of its completely pooled or unpooled counterparts.

3 Parameter Estimation

3.1 Forward-Backward Algorithm

For subject $i=1,\ldots,n$ , let ${\bm{\mathbf{{a}}}}_{i}\in\mathbb{R}^{M-1}$ be the $M-1$ free parameters in the initial probabilities ${\bm{\mathbf{{\pi}}}}_{i}$ , and let ${\bm{\mathbf{{c}}}}_{i}\in\mathbb{R}^{M(M-1)}$ be the $M(M-1)$ free parameters in the transition matrix ${\bm{\mathbf{{Q}}}}_{i}$ . To simplify notation, write ${\bm{\mathbf{{b}}}}_{i,0}\triangleq[b_{i,0,0},\ldots,b_{i,M,0}]\in\mathbb{R}^{M+1}$ and ${\bm{\mathbf{{b}}}}_{i,1}\triangleq[{\bm{\mathbf{{b}}}}_{i,0,1}^{\prime},\ldots,{\bm{\mathbf{{b}}}}_{i,M,1}^{\prime}]\in\mathbb{R}^{q(M+1)}$ to denote the parameters indexing the generalized linear models for the activity counts in each state. Define the entire vector of parameters for subject $i$ to be ${\bm{\mathbf{{\theta}}}}_{i}\triangleq[{\bm{\mathbf{{a}}}}_{i}^{\prime},{\bm{\mathbf{{c}}}}_{i}^{\prime},{\bm{\mathbf{{b}}}}_{i,0}^{\prime},{\bm{\mathbf{{b}}}}_{i,1}^{\prime}]\in\mathbb{R}^{(M+1)(M+q)}$ . The likelihood function for subject $i$ is computed using the forward-backward algorithm (Rabiner, 1989) as follows. For subject $i=1,\ldots,n$ , define the forward variables for $k=1,\ldots,K-1$ , and $m=1,\ldots,M$ ,

[TABLE]

The initialization and recursion formulas are defined as

[TABLE]

where $m=1,\ldots,M$ and $k=1,\ldots,K-1$ . The negative log-likelihood for ${\bm{\mathbf{{\theta}}}}_{i}$ is therefore $f_{i}({\bm{\mathbf{{\theta}}}}_{i})=-log\left\{\sum_{m=1}^{M}\alpha_{i}^{T_{K}}(m;{\bm{\mathbf{{\theta}}}}_{i})\right\}$ . Define the joint likelihood for ${\bm{\mathbf{{\theta}}}}=({\bm{\mathbf{{\theta}}}}_{1},\ldots,{\bm{\mathbf{{\theta}}}}_{n})$ to be $f({\bm{\mathbf{{\theta}}}})=\sum_{i=1}^{n}f_{i}({\bm{\mathbf{{\theta}}}}_{i})$ .

To compute the conditional state probabilities in the HCTHMM, we need to generate a set of auxiliary backward variables analogous to the forward variables defined previously. For subject $i=1,\ldots,n$ , define the backward variables for $k=1,\ldots,K$ , and $m=1,\ldots,M$ ,

[TABLE]

The initialization and recursion formulas are

[TABLE]

where $m=1,\ldots,M$ . The probability of state $m$ for subject $i$ at time $t$ is

[TABLE]

where $t=T_{1},\ldots,T_{k}$ , $m=1,\ldots,M$ , and $i=1,\ldots,n$ . The mean probability of state $m$ among subjects in group $j$ is thus

[TABLE]

The mean state probabilities $\phi_{j}(m;{\bm{\mathbf{{\theta}}}})$ can be interpreted as the mean proportion of time spent in latent state $m$ for subjects in group $j$ , whereas $\eta_{i}(m;{\bm{\mathbf{{\theta}}}}_{i})$ represents the mean proportion of time spent in state $m$ for subject $i$ .

3.2 Consensus Optimization

If all sets of parameters are subject-specific, then the maximum likelihood estimates for the parameters can be obtained by minimizing $f({\bm{\mathbf{{\theta}}}})$ using the gradient-based methods which can be parallelized across subjects. However, in the general setting where parameters are shared across subgroups of subjects, such paralellization is no longer possible. Instead, we use the consensus optimization approach to obtain the maximum likelihood estimates in the HCTHMM, which is performed via the alternating direction method of multipliers (ADMM) (Boyd et al., 2011). We use the Bayesian Information Criterion (BIC) to select the number of latent states $M$ .

Let ${\bm{\mathbf{{D}}}}$ denote a contrast matrix such that ${\bm{\mathbf{{D}}}}{\bm{\mathbf{{\theta}}}}=0$ corresponds to equality of subgroup-specific parameters within each subgroup and equality of all population-level parameters across all subjects. The maximum likelihood estimator solves

[TABLE]

For the purpose of illustration, suppose that: (i) the intercepts in the generalized linear models for state-dependent parameters are subject-specific; (ii) the initial state probabilities and the transition rate parameters are subgroup-specific; (iii) the slope parameters in the generalized linear models for state-dependent parameters are common across all subjects. Then ${\bm{\mathbf{{D}}}}{\bm{\mathbf{{\theta}}}}=0$ is the same as restricting (i) ${\bm{\mathbf{{a}}}}_{i_{1}}={\bm{\mathbf{{a}}}}_{i_{2}}$ for all $G_{i_{1}}=G_{i_{2}}$ ; (ii) ${\bm{\mathbf{{c}}}}_{i_{1}}={\bm{\mathbf{{c}}}}_{i_{2}}$ for all $G_{i_{1}}=G_{i_{2}}$ ; (iii) ${\bm{\mathbf{{b}}}}_{i,1}={\bm{\mathbf{{b}}}}_{1}$ .

In our illustrative example, the maximum likelihood estimator solves

[TABLE]

where ${\bm{\mathbf{{z}}}}$ represents the set of all subgroup-specific and common parameters in ${\bm{\mathbf{{\theta}}}}$ so the linear constraint ${\bm{\mathbf{{A}}}}_{i}{\bm{\mathbf{{\theta}}}}_{i}={\bm{\mathbf{{B}}}}_{i}{\bm{\mathbf{{z}}}}$ is equivalent to ${\bm{\mathbf{{D}}}}{\bm{\mathbf{{\theta}}}}=0$ . The corresponding augmented Lagrangian is

[TABLE]

Here ${\bm{\mathbf{{\xi}}}}$ are the Lagrange multipliers and $\rho$ is a pre-specified positive penalty parameter. Let $\tilde{{\bm{\mathbf{{\theta}}}}}^{(v)}_{n},\tilde{{\bm{\mathbf{{z}}}}}^{(v)}_{n},\tilde{{\bm{\mathbf{{\xi}}}}}^{(v)}_{n}$ be the $v^{th}$ iterates of ${\bm{\mathbf{{\theta}}}},{\bm{\mathbf{{z}}}},{\bm{\mathbf{{\xi}}}}$ . Then, at the iteration $v+1$ , the ADMM algorithm updates are

[TABLE]

where the most computationally expensive ${\bm{\mathbf{{\theta}}}}$ -update can be programmed in parallel across each $i=1,\ldots,n$ as

[TABLE]

The gradients $\nabla f_{i}(\cdot)$ for subject $i$ ’s HMM parameters can be computed using Fisher’s identity (Cappé et al., 2005) based on the efficient EM algorithm proposed in Liu et al. (2015). The details are included in the Supplementary Materials. The use of gradients is needed in our model both due to the ADMM update and the covariate structure. Even when there are no covariates and all parameters are shared across subjects, the gradient method is still faster than the EM algorithm because M-step is expensive in the zero-inflated Poisson distribution.

3.3 Theoretical Properties

Define $\hat{{\bm{\mathbf{{\theta}}}}}_{n}$ to be the maximum likelihood estimator for ${\bm{\mathbf{{\theta}}}}$ and let ${\bm{\mathbf{{\theta}}}}^{\star}$ denote its population-level analog. The following are the sufficient conditions to ensure: (i) almost sure convergence of $\hat{{\bm{\mathbf{{\theta}}}}}_{n}$ to ${\bm{\mathbf{{\theta}}}}^{\star}$ as $n\to\infty$ , and (ii) numerical convergence of $\tilde{{\bm{\mathbf{{\theta}}}}}_{n}^{(v)}$ to $\hat{{\bm{\mathbf{{\theta}}}}}_{n}$ as $v\to\infty$ .

(A0)

The true parameter vector ${\bm{\mathbf{{\theta}}}}^{\star}$ for the unconstrained optimization problem $\underset{{\bm{\mathbf{{\theta}}}}}{\min}f({\bm{\mathbf{{\theta}}}})$ is an interior point of $\mathbf{\Theta}$ , where $\mathbf{\Theta}$ is a compact subset of $\mathbb{R}^{\dim\,{\bm{\mathbf{{\theta}}}}}$ . 2. (A1)

The constraint set $\mathcal{C}\triangleq\{{\bm{\mathbf{{\theta}}}}\in\mathbf{\Theta};{\bm{\mathbf{{D}}}}{\bm{\mathbf{{\theta}}}}=0\}$ is nonempty and for some $r\in\mathbb{R}$ , the set $\{{\bm{\mathbf{{\theta}}}}\in C;f({\bm{\mathbf{{\theta}}}})\leq r\}$ is nonempty and compact. 3. (A2)

The observed time process $(T_{k}:k\in\mathbb{N})$ is independent of the generative hidden Markov process: the likelihood for the observed times do not share parameters with ${\bm{\mathbf{{\theta}}}}$ . 4. (A3)

There exist positive real numbers $0<\kappa^{-}\leq\kappa^{+}<1$ such that for all subjects $i=1,\ldots,n$ , $\kappa^{-}\leq P_{i}^{T_{k+1}-T_{k}}(m,\ell)\leq\kappa^{+}$ for $k=1,\ldots,K-1$ and $m,\ell=1,\ldots,M$ , almost surely and $g_{i,m}(y;{\bm{\mathbf{{x}}}})>0$ for all $y\in\textrm{ supp }Y$ for some $m=1,\ldots,M$ . 5. (A4)

For each ${\bm{\mathbf{{\theta}}}}\in\mathbf{\Theta}$ , the transition kernel indexed by ${\bm{\mathbf{{\theta}}}}$ is Harris recurrent and aperiodic. The transition kernel is continuous in $\boldsymbol{\theta}$ in an open neighborhood of $\boldsymbol{\theta}^{\star}$ . 6. (A5)

The hidden Markov model is identifiable up to label switching of the latent states. 7. (A6)

The negative log likelihood function $f({\bm{\mathbf{{\theta}}}})$ is twice differentiable with respect to ${\bm{\mathbf{{\theta}}}}$ with bounded, continuous derivatives. Denote by $e_{min}({\bm{\mathbf{{\theta}}}})$ and $e_{max}({\bm{\mathbf{{\theta}}}})$ to be the smallest and largest eigenvalues of $\nabla^{2}f({\bm{\mathbf{{\theta}}}})$ then there exists positive real numbers $0<\varrho_{-}\leq\varrho_{+}<\infty$ such that $e_{min}({\bm{\mathbf{{\theta}}}})\geq\varrho_{-}>0$ and $e_{max}({\bm{\mathbf{{\theta}}}})\leq\varrho_{+}<\infty$ for all ${\bm{\mathbf{{\theta}}}}\in{\bm{\mathbf{{\Theta}}}}$ .

Assumption (A0)-(A2) are mild regularity conditions whereas (A3) - (A5) are standard in latent state space models; together they ensures that the model is well-defined. Assumption (A4) is a avoids non-standard asymptotic behavior associated with non-smooth functionals. Assumption (A6) can be used to show the Lipschitz continuity in the gradient and strong local convexity which are used to establish the numerical convergence in the ADMM algorithm Shi et al. (2014).

Theorem 3.1.

Under assumptions (A0) - (A6), as $T_{i}\to\infty$ for $i=1,\ldots,n$ ,

(i)

$\hat{{\bm{\mathbf{{\theta}}}}}_{n}$ * converges to ${\bm{\mathbf{{\theta}}}}^{\star}$ almost surely as $n\to\infty$ ,* 2. (ii)

$\tilde{{\bm{\mathbf{{\theta}}}}}_{n}^{(v)}$ * converges numerically to $\hat{{\bm{\mathbf{{\theta}}}}}_{n}$ in an open neighborhood of ${\bm{\mathbf{{\theta}}}}^{\star}$ as $v\to\infty$ .*

The first part of Theorem 1 states the almost sure convergence of the constrained maximum likelihood estimator $\hat{{\bm{\mathbf{{\theta}}}}}_{n}$ to the true parameter value ${\bm{\mathbf{{\theta}}}}^{\star}$ . This can be shown using the uniform convergence results of the log likelihood (Douc et al., 2004) for each subject-specific hidden Markov model, along with the feasibility assumption (A1) and identifiability assumption (A5). The second part of Theorem states the numerical convergence of the ADMM algorithm. This is anticipated by Boyd et al. (2011) which identifies general conditions for the numerical convergence of the residual, the dual variable, and the objective function. Shi et al. (2014) extended the convergence to the primal variable by adding the Lipschitz continuity and strong convexity assumptions. The details for those assumptions, as well as the proof for Theorem 1, are included in the Supplementary Materials.

For $i=1,\ldots,n$ , define the estimator for the mean proportion of time in state $m$ in group $k$ as $\hat{\phi}_{j,n}(m),j=1,\ldots,J,m=1,\ldots,M$ , and the estimator for the mean proportion of time in state $m$ as $\hat{\eta}_{i,k_{i}}$ where $k_{i}$ is the number of observed time points for subject $i$ . The following result characterizes the limiting behavior of the estimated time in each state.

Theorem 3.2.

Under (A0) - (A6), as $k_{i}\to\infty$ for $i=1,\ldots,n$ , $n_{j}\to\infty$ for $j=1,\ldots,J$ ,

(i)

$\hat{\phi}_{j,n}(m;\hat{{\bm{\mathbf{{\theta}}}}}_{n})$ * converges to $\mu_{j}(m;{\bm{\mathbf{{\theta}}}}^{\star})$ almost surely,*

(ii)

$\frac{\hat{\phi}_{j,n}(m;\hat{{\bm{\mathbf{{\theta}}}}}_{n})-\mu_{j}(m;\hat{{\bm{\mathbf{{\theta}}}}}_{n})}{\sqrt{\sigma_{j}^{2}(m;\hat{{\bm{\mathbf{{\theta}}}}}_{n})/n_{j}}}$ * converges in distribution to a standard normal random variable, where $\mu_{j}(m;\hat{{\bm{\mathbf{{\theta}}}}}_{n})=E[\hat{\eta}_{i,k_{i}}(m;\hat{{\bm{\mathbf{{\theta}}}}}_{n})],\;\sigma_{j}^{2}(m;\hat{{\bm{\mathbf{{\theta}}}}}_{n})=Var[\hat{\eta}_{i,k_{i}}(m;\hat{{\bm{\mathbf{{\theta}}}}}_{n})]$ for all $i$ such that $G_{i}=j.$ *

A proof of the preceding result is given in the Supplementary Materials, which follows from the almost sure convergence of a bounded continuous function and the central limit theorem. In principle, $\mu_{j}(m;\hat{{\bm{\mathbf{{\theta}}}}}_{n})$ can be obtained from the limiting distribution of a stationary continuous-time Markov chain, which is determined by the transition as a function of $\hat{{\bm{\mathbf{{\theta}}}}}_{n}$ . However, it is generally not easy to compute the standard error analytically for the estimated mean state probabilities. Instead, we use a stratified nonparametric bootstrap (Efron, 1992) in which we resample subjects with replacement from each subgroup.

4 Simulation experiments

We study the finite sample performance of the proposed estimator for the state probabilities using a suite of simulation experiments. We simulate minute-by-minute activity counts of length $T\sim\textrm{Uniform}(500,2500)$ for $n=20$ and $n=200$ subjects, where half of the subjects are male (Group 1) and the other half female (Group 2). The intervals between consecutive time points are independently drawn from $\{1,2,\ldots,10\}$ with equal probabilities. For each subject, we assume 2/7 of the observations are from weekends and 5/7 of the observations are from weekdays.

The activity counts are generated using a three state continuous-time zero-inflated Poisson hidden Markov model. We assume that during the weekend the log mean activity decreases by 10%, 20%, 30% in states 1, 2, 3 respectively, while the log odds of zero in state 1 increases by 10%, so that

[TABLE]

where $b_{i,0,0}\overset{iid}{\sim}N(-1,0.1^{2}),b_{i,1,0}\overset{iid}{\sim}N\left\{\log(50),0.1^{2}\right\},b_{i,2,0}\overset{iid}{\sim}N\left\{\log(300),0.1^{2}\right\},b_{i,3,0}\overset{iid}{\sim}N\left\{\log(700),0.1^{2}\right\}$ are subject-specific intercepts; the weekend effect is assumed to be common across all subjects.

The initial probabilities for male are $(U_{1},U_{2},1-U_{1}-U_{2})$ , where $U_{1},U_{2}\overset{iid}{\sim}$ Uniform $(0.2,0.4)$ ; for female, the initial probabilities are $(U_{3},U_{4},1-U_{3}-U_{4})$ , where $U_{3}\overset{iid}{\sim}$ Uniform $(0.6,0.8)$ , $U_{4}\overset{iid}{\sim}$ Uniform $(0.1,0.2)$ . The transition rate matrix for male is

[TABLE]

where $U_{5},\ldots,U_{10}\overset{iid}{\sim}\mathrm{Uniform}(0.05,0.15)$ ; for female, the transition rate matrix is

[TABLE]

where $U_{11},U_{12}\overset{iid}{\sim}\mathrm{Uniform}(0.05,0.1)$ , $U_{13},U_{15}\overset{iid}{\sim}\mathrm{Uniform}(0.3,0.4)$ , and $U_{14},U_{16}\overset{iid}{\sim}\mathrm{Uniform}(0.1,0.2)$ .

Table 1 shows the bias and standard error of the estimators for different hierarchies of parameters in the HCTHMM via 500 simulations. In both cases, the biases are small due to the fact that the length of each individual series is large. As the sample size increases, the standard errors become smaller which is expected. Figure 1 shows the average runtime (seconds) for each ADMM iteration scales linearly with the number of subjects. It generally takes some 30 to 100 iterations for the algorithm to converge. Table 2 compares the mean coverage probability of the 95% and 99% bootstrap confidence intervals for the mean proportion of time in each latent state bewteen a baseline subject-specific HMM and the proposed HCTHMM when the sample size is 200. As we can see, the baseline subject-specific HMM suffers undercoverage (coverage probability smaller than nominal level) in some of the latent states, while the proposed HCTHMM recovers the nominal level well in both the 95% and 99% cases.

5 Application

The motivating application is a human physical activity data set from the 2003 - 2004 National Health and Nutrition Examination Survey (NHANES), which is publicly available at the National Center for Disease Control (CDC) website https://wwwn.cdc.gov/Nchs/Nhanes/2003-2004/PAXRAW_C.htm. There are 7,176 participants in the study, and for each participant we have minute-by-minute activity counts for up to seven days. As the subjects were supposed to remove the accelerometer when washing, there are prolonged intervals during the day when accelerometer readings are zeros. We further impose the following two inclusion / exclusion criteria,

•

Subjects whose age is between 20 and 60 are included.

•

Subjects with very few measurements are excluded.

The first criterion specifies the scope of inference. The second criterion exclude subjects with very few non-missing data available ( $<$ 500 minutes out of 7 days). There are 2,467 subjects who satisfy both conditions, which constitute more than 95% of those whose age is between 20 and 60. Further, we split those subjects by their baseline characteristics (gender, age) into 4 subgroups. Subgroup 1 consists of 608 male subjects with age from 20 to 40; subgroup 2 consists of 557 male subjects with age from 40 to 60; subgroup 3 consists of 712 female subjects with age from 20 to 40; and subroup 4 consists of 590 female subjects with age from 40 to 60.

Table 3 summarizes the related work on the length of an extended period of zero activity counts to be defined as missingness. In this paper, we choose to define missingness as a sustained interval of greater than or equal to 20 consecutive zero activity counts, which is the most commonly used criterion in the literature. Most missingness occurs between 10 pm to 8 am, which is the sleep time for most of the subjects. There is still sporadic missingness during other periods of time in the day, which may correspond to activities like swimming or bathing. The missingness periods are removed during the data preprocessing. The average proportion of zeros after removing the missingness is around 25%, so that zero-inflation is still an issue to be considered in the modeling. In the data preprocessing, activity counts greater than 1,500 ( $<$ 5%) are truncated at 1,500 to ensure the numeric stability of the fitting algorithm.

To apply the HCTHMM model on the activity counts data, we need to select the number of latent states as well as the hierarchy for different sets of the parameters. The weekend effect is adjusted for in the Poisson and zero-inflated Poisson regression on the activity counts in each latent state. By the minimum BIC criterion as shown in Table 4, we select the type IV HCTHMM with 6 latent states, where the intercepts in the state-dependent generalized linear models for logit zero proportion in state 1 and log Poisson means in the all states are subject-specific, while the initial probabilities, transition rates, and the slopes in the state-dependent generalized linear models are subgroup-specific. This final model indicates the baseline zero proportion and mean activity counts in each latent state vary across subjects. For all the other parameters, the between-subgroup variability is more prominent than the within-subgroup variability. Figure 2 shows the 99% confidence interval for the estimated proportion of time spent in latent activity states for each subgroup in 03 - 04 NHANES. There are several interesting findings. First, younger men spend less time in the low intensity activity states (state 1, 2) than older men and women. Second, men spend less time than women in the medium intensity activity states (state 3, 4). Third, men spend more time than women in the high intensity activity states (state 5, 6). Figure 3 plots the estimated quantiles versus the observed quantiles for the accelerometer data, which aligns well along the 45-degree line, indicating no lack of fit. To validate the results, we apply the HCTHMM methodology to 05 - 06 NHANES, which has the same study setup and data structure as the 03 - 04 NHANES. Figure 4 shows the 99% confidence interval for the estimated proportion of time spent in latent activity states for each subgroup in 05 - 06 NHANES, which has a similar pattern as seen in 03 - 04 NHANES.

6 Conclusions

We propose HCTHMM to be valid inference strategy for the longitudinal activity data. Within this framework, we can estimate the mean state probabilities for different subgroups of subjects as well as quantify the uncertainty. Our findings are consistent with previous literature on human physical activity (Metzger et al., 2008; Troiano et al., 2008; Hansen et al., 2012; Xiao et al., 2014), which indicated that the physical activity can be classified into different categories by intensity, and that the activity level decreases as a result of aging. Moreover, women tend to spend more time in lighter intensity activity, whereas younger men tend to have periods of higher intensity activities.

In the future, this HCTHMM framework can be extended to the controlled clinical studies to estimate certain treatment effects in a specific cohort of patients. We can also allow for time-varying covariates in the transition rates. Moreover, when some model parameters are truly subject-specific or subgroup-specific, it may be more powerful to model them as random so that tests based on variance components can be constructed to test their effects. Another modification is to extend the latent continuous-time Markov process to a semi-Markov process. This will be scientifically interesting because it is reasonable to assume that the current latent state not only depends on the most recent past state but also on the history of the state trajectory. However, all these changes are computationally expensive, especially on such large-scale high-frequency data. Corresponding estimation methods have to be developed before the application becomes feasible.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Albert (1962) Albert, A. (1962). Estimating the infinitesimal generator of a continuous time, finite state markov process. The Annals of Mathematical Statistics , 727–753.
2Altman (2007) Altman, R. M. (2007). Mixed hidden markov models: an extension of the hidden markov model to the longitudinal data setting. Journal of the American Statistical Association 102 (477), 201–210.
3Bartalesi et al. (2005) Bartalesi, R., F. Lorussi, M. Tesconi, A. Tognetti, G. Zupone, and D. De Rossi (2005). Wearable kinesthetic system for capturing and classifying upper limb gesture. In Eurohaptics Conference, 2005 and Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems, 2005. World Haptics 2005. First Joint , pp. 535–536. IEEE.
4Boyd et al. (2011) Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3 (1), 1–122.
5Cappé et al. (2005) Cappé, O., E. Moulines, and T. Rydén (2005). Inference in hidden markov models. In Springer Series in Statistics .
6Catellier et al. (2005) Catellier, D. J., P. J. Hannan, D. M. Murray, C. L. Addy, T. L. Conway, S. Yang, and J. C. Rice (2005). Imputation of missing data when measuring physical activity by accelerometry. Medicine and science in sports and exercise 37 (11 Suppl), S 555.
7Cradock et al. (2004) Cradock, A. L., J. L. Wiecha, K. E. Peterson, A. M. Sobol, G. A. Colditz, and S. L. Gortmaker (2004). Youth recall and tritrac accelerometer estimates of physical activity levels. Medicine and science in sports and exercise 36 (3), 525–532.
8Douc et al. (2004) Douc, R., E. Moulines, T. Rydén, et al. (2004). Asymptotic properties of the maximum likelihood estimator in autoregressive models with markov regime. The Annals of statistics 32 (5), 2254–2304.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Hierarchical Continuous Time Hidden Markov Model, with Application in Zero-Inflated Accelerometer Data

Abstract

1 Introduction

2 Model Framework

3 Parameter Estimation

3.1 Forward-Backward Algorithm

3.2 Consensus Optimization

3.3 Theoretical Properties

Theorem 3.1**.**

Theorem 3.2**.**

4 Simulation experiments

5 Application

6 Conclusions

Theorem 3.1.

Theorem 3.2.