Asymptotic Properties of the Maximum Likelihood Estimator in Regime   Switching Econometric Models

Hiroyuki Kasahara; Katsumi Shimotsu

arXiv:1705.10445·math.ST·June 29, 2018

Asymptotic Properties of the Maximum Likelihood Estimator in Regime Switching Econometric Models

Hiroyuki Kasahara, Katsumi Shimotsu

PDF

Open Access

TL;DR

This paper proves the asymptotic normality and consistency of the maximum likelihood estimator in complex Markov regime switching models, including those with zero transition probabilities and regime-dependent densities.

Contribution

It establishes the asymptotic properties of the MLE for a broad class of Markov regime switching models previously lacking such theoretical validation.

Findings

01

MLE is asymptotically normal in these models

02

Consistency of the covariance matrix estimator is proven

03

Applicable to models with zero transition probabilities and regime-dependent densities

Abstract

Markov regime switching models have been widely used in numerous empirical applications in economics and finance. However, the asymptotic distribution of the maximum likelihood estimator (MLE) has not been proven for some empirically popular Markov regime switching models. In particular, the asymptotic distribution of the MLE has been unknown for models in which some elements of the transition probability matrix have the value of zero, as is commonly assumed in empirical applications with models with more than two regimes. This also includes models in which the regime-specific density depends on both the current and the lagged regimes such as the seminal model of Hamilton (1989) and switching ARCH model of Hamilton and Susmel (1994). This paper shows the asymptotic normality of the MLE and consistency of the asymptotic covariance matrix estimate of these models.

Figures2

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1: Coverage probability of the asymptotic 95% confidence intervals for Hamilton’s model

Panel A: 95% confidence intervals constructed from (18)
	$p_{11}$	$p_{21}$	$β_{1}$	$β_{2}$	$β_{3}$	$β_{4}$	$μ_{1}$	$μ_{2}$	$σ$
$n = 200$	0.916	0.911	0.938	0.926	0.944	0.925	0.916	0.896	0.875
$n = 400$	0.938	0.933	0.930	0.944	0.943	0.937	0.946	0.929	0.922
$n = 800$	0.942	0.942	0.945	0.941	0.950	0.956	0.939	0.941	0.930
Panel B: 95% confidence intervals constructed from the OPG estimator
	$p_{11}$	$p_{21}$	$β_{1}$	$β_{2}$	$β_{3}$	$β_{4}$	$μ_{1}$	$μ_{2}$	$σ$
$n = 200$	0.915	0.920	0.938	0.927	0.941	0.934	0.922	0.901	0.884
$n = 400$	0.932	0.932	0.938	0.949	0.942	0.939	0.945	0.929	0.923
$n = 800$	0.943	0.945	0.945	0.939	0.949	0.956	0.936	0.937	0.929

Table 2. Table 2: Coverage probability of the asymptotic 95% confidence intervals for the MS-CD model

Panel A: 95% confidence intervals from (18)
	$p_{11}$	$p_{22}$	$μ_{1}$	$μ_{2}$	$β$	$γ$
$n = 200$	0.713	0.759	0.757	0.876	0.928	0.910
$n = 400$	0.827	0.844	0.861	0.902	0.956	0.919
$n = 800$	0.892	0.909	0.932	0.949	0.987	0.952
Panel B: 95% confidence intervals from (17)
	$p_{11}$	$p_{22}$	$μ_{1}$	$μ_{2}$	$β$	$γ$
$n = 200$	0.713	0.759	0.757	0.876	0.928	0.910
$n = 400$	0.835	0.846	0.862	0.903	0.957	0.919
$n = 800$	0.894	0.910	0.932	0.950	0.987	0.950

Table 3. Table 3: Estimates of the MS-CD model for the FIAT stock duration

	$M = 2$		$M = 3$
	Estimate	S.D.	Estimate	S.D.
$μ_{1}$	0.483	0.013	0.359	0.009
$μ_{2}$	1.138	0.021	0.717	0.043
$μ_{3}$			1.290	0.086
$β$	0.053	0.012	0.032	0.017
$γ$	0.987	0.009	1.005	0.011
$p_{11}$	0.991	0.002	0.991	0.009
$p_{21}$			0.003	0.004
$p_{22}$	0.996	0.001	0.984	0.008
$p_{33}$			0.991	0.008
Log-likelihood	-7037.90		-6972.75

Equations309

Y_{k} = f_{θ} (Y_{k - 1}, \dots, Y_{k - s}, X_{k}; ε_{k}),

Y_{k} = f_{θ} (Y_{k - 1}, \dots, Y_{k - s}, X_{k}; ε_{k}),

X_{k} = (X_{k}, X_{k - 1}, \dots, X_{k - p + 1}),

X_{k} = (X_{k}, X_{k - 1}, \dots, X_{k - p + 1}),

Y_{k} = μ_{\tilde{X}_{k}} + u_{k} with u_{k} = ℓ = 1 \sum p - 1 γ_{ℓ} u_{k - ℓ} + σ ε_{k} for p \geq 2,

Y_{k} = μ_{\tilde{X}_{k}} + u_{k} with u_{k} = ℓ = 1 \sum p - 1 γ_{ℓ} u_{k - ℓ} + σ ε_{k} for p \geq 2,

Y_{k}

Y_{k}

Δ Y_{k} = μ_{\tilde{X}_{k}} + λ j = 1 \sum p - 1 \tilde{X}_{k - j} + ε_{k} with ε_{k} \sim i.i.d N (0, 1),

Δ Y_{k} = μ_{\tilde{X}_{k}} + λ j = 1 \sum p - 1 \tilde{X}_{k - j} + ε_{k} with ε_{k} \sim i.i.d N (0, 1),

Y_{k} = (μ_{X_{k}} + β Y_{k - 1}) ε_{k},

Y_{k} = (μ_{X_{k}} + β Y_{k - 1}) ε_{k},

p_{θ} (Y_{1}^{n} ∣ \overline{Y}_{0}, W_{0}^{n}, x_{0}) = \int k = 1 \prod n p_{θ} (Y_{k}, x_{k} ∣ \overline{Y}_{k - 1}, x_{k - 1}, W_{k}) μ^{\otimes n} (d x_{1}^{n}),

p_{θ} (Y_{1}^{n} ∣ \overline{Y}_{0}, W_{0}^{n}, x_{0}) = \int k = 1 \prod n p_{θ} (Y_{k}, x_{k} ∣ \overline{Y}_{k - 1}, x_{k - 1}, W_{k}) μ^{\otimes n} (d x_{1}^{n}),

p_{θ} (Y_{1}^{k} ∣ \overline{Y}_{0}, W_{0}^{n}, x_{0})

p_{θ} (Y_{1}^{k} ∣ \overline{Y}_{0}, W_{0}^{n}, x_{0})

p_{θ} (Y_{1}^{k} ∣ \overline{Y}_{0}, W_{0}^{n})

l_{n} (θ, x_{0})

l_{n} (θ, x_{0})

l_{n} (θ)

l_{n} (θ, ξ) := lo g (\int p_{θ} (Y_{1}^{n} ∣ \overline{Y}_{0}, W_{0}^{n}, x_{0}) ξ (d x_{0})) .

l_{n} (θ, ξ) := lo g (\int p_{θ} (Y_{1}^{n} ∣ \overline{Y}_{0}, W_{0}^{n}, x_{0}) ξ (d x_{0})) .

\int_{X} P_{θ} (X_{k} \in \cdot X_{- m} = x, \overline{y}_{- m}^{n}, w_{- m}^{n}) μ_{1} (d x) - \int_{X} P_{θ} (X_{k} \in \cdot X_{- m} = x, \overline{y}_{- m}^{n}, w_{- m}^{n}) μ_{2} (d x)_{T V}

\int_{X} P_{θ} (X_{k} \in \cdot X_{- m} = x, \overline{y}_{- m}^{n}, w_{- m}^{n}) μ_{1} (d x) - \int_{X} P_{θ} (X_{k} \in \cdot X_{- m} = x, \overline{y}_{- m}^{n}, w_{- m}^{n}) μ_{2} (d x)_{T V}

\leq i = 1 \prod ⌊(k + m) / p ⌋ (1 - ω (\overline{y}_{- m + p i - p}^{- m + p i - 1}, w_{- m + p i - p}^{- m + p i - 1})),

ω (\overline{y}_{k - p}^{k - 1}, w_{k - p}^{k - 1}) := \frac{σ _{-}}{σ _{+}} \frac{in f _{θ} in f _{x_{k - p + 1}^{k - 1}} \prod _{i = k - p + 1}^{k - 1} g _{θ} ( y _{i} ∣ y _{i - 1} , x _{i} , w _{i} )}{sup _{θ} sup _{x_{k - p + 1}^{k - 1}} \prod _{i = k - p + 1}^{k - 1} g _{θ} ( y _{i} ∣ y _{i - 1} , x _{i} , w _{i} )}^{2} .

ω (\overline{y}_{k - p}^{k - 1}, w_{k - p}^{k - 1}) := \frac{σ _{-}}{σ _{+}} \frac{in f _{θ} in f _{x_{k - p + 1}^{k - 1}} \prod _{i = k - p + 1}^{k - 1} g _{θ} ( y _{i} ∣ y _{i - 1} , x _{i} , w _{i} )}{sup _{θ} sup _{x_{k - p + 1}^{k - 1}} \prod _{i = k - p + 1}^{k - 1} g _{θ} ( y _{i} ∣ y _{i - 1} , x _{i} , w _{i} )}^{2} .

\hat{θ}_{x_{0}} := θ \in Θ ar g max l_{n} (θ, x_{0}),

\hat{θ}_{x_{0}} := θ \in Θ ar g max l_{n} (θ, x_{0}),

P_{θ^{*}} (b_{+} (\overline{Y}_{0}^{1}, W_{1}) / b_{-} (\overline{Y}_{0}^{1}, W_{1}) \geq C_{1} e^{α r}) \leq C_{2} r^{- β} .

P_{θ^{*}} (b_{+} (\overline{Y}_{0}^{1}, W_{1}) / b_{-} (\overline{Y}_{0}^{1}, W_{1}) \geq C_{1} e^{α r}) \leq C_{2} r^{- β} .

n^{- 1} x_{0} \in X sup θ \in Θ sup ∣ l_{n} (θ, x_{0}) - l_{n} (θ) ∣ \to 0 P_{θ^{*}} - a . s .

n^{- 1} x_{0} \in X sup θ \in Θ sup ∣ l_{n} (θ, x_{0}) - l_{n} (θ) ∣ \to 0 P_{θ^{*}} - a . s .

Δ_{k, m, x} (θ)

Δ_{k, m, x} (θ)

Δ_{k, m} (θ)

= lo g \int p_{θ} (Y_{k} ∣ \overline{Y}_{- m}^{k - 1}, W_{- m}^{k}, X_{- m} = x_{- m}) P_{θ} (d x_{- m} ∣ \overline{Y}_{- m}^{k - 1}, W_{- m}^{k}),

(a)

(a)

(b)

(c)

0 = \nabla_{θ} l_{n} (\hat{θ}_{x_{0}}, x_{0}) = \nabla_{θ} l_{n} (θ^{*}, x_{0}) + \nabla_{θ}^{2} l_{n} (\overline{θ}, x_{0}) (\hat{θ}_{x_{0}} - θ^{*}),

0 = \nabla_{θ} l_{n} (\hat{θ}_{x_{0}}, x_{0}) = \nabla_{θ} l_{n} (θ^{*}, x_{0}) + \nabla_{θ}^{2} l_{n} (\overline{θ}, x_{0}) (\hat{θ}_{x_{0}} - θ^{*}),

\nabla_{θ} lo g p_{θ} (Y_{- m + 1}^{k} ∣ \overline{Y}_{- m}, W_{- m}^{k}, X_{- m})

\nabla_{θ} lo g p_{θ} (Y_{- m + 1}^{k} ∣ \overline{Y}_{- m}, W_{- m}^{k}, X_{- m})

= E_{θ} [\nabla_{θ} lo g p_{θ} (Y_{- m + 1}^{k}, X_{- m + 1}^{k} ∣ \overline{Y}_{- m}, W_{- m}^{k}, X_{- m}) \overline{Y}_{- m}^{k}, W_{- m}^{k}, X_{- m}]

= E_{θ} [t = - m + 1 \sum k \nabla_{θ} lo g p_{θ} (Y_{t}, X_{t} ∣ \overline{Y}_{t - 1}, X_{t - 1}, W_{t}) \overline{Y}_{- m}^{k}, W_{- m}^{k}, X_{- m}],

ϕ^{j} (θ, \overline{Z}_{k - 1}^{k}, W_{k})

ϕ^{j} (θ, \overline{Z}_{k - 1}^{k}, W_{k})

= \nabla_{θ}^{j} lo g q_{θ} (X_{k - 1}, X_{k}) + \nabla_{θ}^{j} lo g g_{θ} (Y_{k}, ∣ \overline{Y}_{k - 1}, X_{k}, W_{k}) .

Ψ_{k, m, x}^{j} (θ)

Ψ_{k, m, x}^{j} (θ)

\nabla_{θ} l_{n} (θ, x_{0}) = k = 1 \sum n \nabla_{θ} lo g p_{θ} (Y_{k} ∣ \overline{Y}_{0}^{k - 1}, W_{0}^{k}, X_{0} = x_{0}) = k = 1 \sum n Ψ_{k, 0, x_{0}}^{1} (θ) .

\nabla_{θ} l_{n} (θ, x_{0}) = k = 1 \sum n \nabla_{θ} lo g p_{θ} (Y_{k} ∣ \overline{Y}_{0}^{k - 1}, W_{0}^{k}, X_{0} = x_{0}) = k = 1 \sum n Ψ_{k, 0, x_{0}}^{1} (θ) .

Ψ_{k, m}^{j} (θ)

Ψ_{k, m}^{j} (θ)

(a)

(a)

(b)

\nabla_{θ}^{2} lo g p_{θ} (Y_{- m + 1}^{k} ∣ \overline{Y}_{- m}, W_{- m}^{k}, X_{- m})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMonetary Policy and Economic Impact · Financial Risk and Volatility Modeling · Stochastic processes and financial applications

Full text

**Asymptotic Properties of the Maximum Likelihood Estimator

in Regime Switching Econometric Models††thanks: This research was supported by the Natural Science and Engineering Research Council of Canada and JSPS KAKENHI Grant Number JP17K03653.**

Hiroyuki Kasahara

Vancouver School of Economics

University of British Columbia

[email protected]

Katsumi Shimotsu

Faculty of Economics

University of Tokyo

[email protected]

Abstract

Markov regime switching models have been widely used in numerous empirical applications in economics and finance. However, the asymptotic distribution of the maximum likelihood estimator (MLE) has not been proven for some empirically popular Markov regime switching models. In particular, the asymptotic distribution of the MLE has been unknown for models in which some elements of the transition probability matrix have the value of zero, as is commonly assumed in empirical applications with models with more than two regimes. This also includes models in which the regime-specific density depends on both the current and the lagged regimes such as the seminal model of Hamilton (1989) and switching ARCH model of Hamilton and Susmel (1994). This paper shows the asymptotic normality of the MLE and consistency of the asymptotic covariance matrix estimate of these models.

Keywords: asymptotic distribution; autoregressive conditional heteroscedasticity; maximum likelihood estimator; Markov regime switching

JEL classification numbers: C12, C13, C22

1 Introduction

Since the seminal contribution of Hamilton (1989), Markov regime switching models have become a popular framework for applied empirical work because they can capture the important features of time series such as structural changes, nonlinearity, high persistence, fat tails, leptokurtosis, and asymmetric dependence (e.g., Evans and Wachtel, 1993; Hamilton and Susmel, 1994; Gray, 1996; Sims and Zha, 2006; Inoue and Okimoto, 2008; Ang and Bekaert, 2002; Okimoto, 2008; Dai et al., 2007). Surveys of applications of Markov regime switching models in economics and finance are provided by, for example, Hamilton (2008, 2016) and Ang and Timmermann (2012).

Consider the Markov regime switching model defined by a discrete-time stochastic process $\{Y_{k},X_{k}\}$ written as

[TABLE]

where $\{\varepsilon_{k}\}$ is an independent and identically distributed sequence of random variables, $\{Y_{k}\}$ is an inhomogeneous $s$ -order Markov chain on a state space $\mathcal{Y}$ conditional on $X_{k}$ such that the conditional distribution of $Y_{k}$ only depends on $X_{k}$ and the lagged $Y$ ’s, $X_{k}$ is a first-order Markov process in a state space $\mathcal{X}$ , and $f_{\theta}$ is a family of functions indexed by a finite-dimensional parameter $\theta\in\Theta$ . In (1), the Markov chain $\{X_{k}\}$ is not observable.

Surprisingly, the asymptotic distribution of the maximum likelihood estimator (MLE) of the Markov regime switching model (1) has not been fully established in the existing literature. Bickel et al. (1998) and Jensen and Petersen (1999) derive the asymptotic normality of the MLE of hidden Markov models in which the conditional distribution of $Y_{k}$ depends on $X_{k}$ but not on the lagged $Y$ ’s. For hidden Markov models and Markov regime switching models with a finite state space, the consistency of the MLE has been proven by Leroux (1992), Francq and Roussignol (1998), and Krishnamurthy and Rydén (1998).

In an influential paper, Douc et al. (2004) [DMR hereafter] establish the consistency and asymptotic normality of the MLE in autoregressive Markov regime switching models (1) with a nonfinite hidden state space $\mathcal{X}$ under two assumptions. First, DMR assume that the conditional distribution of $Y_{k}$ does not depend on the lagged $X_{k}$ ’s. Specifically, on page 2259, DMR assume that

for each $n\geq 1$ and given $\{Y_{k}\}_{k=n-s}^{n-1}$ and $X_{n}$ , $Y_{n}$ is conditionally independent of $\{Y_{k}\}_{k=-s+1}^{n-s-1}$ and $\{X_{k}\}_{k=0}^{n-1}$ .

Second, DMR assume in their Assumption A1(a) that the transition density of $X_{k}$ is bounded away from 0.

These two assumptions together rule out regime switching models in which some elements of the transition probability matrix take the value of zero. However, empirical researchers often assume that some elements of the transition probability matrix are identically equal to zero when they estimate regime switching models with more than two regimes. For example, Kim et al. (2005) estimate a three-regime model of U.S. GDP growth in which some elements of the transition probability matrix are restricted to be zero because “the three regimes corresponding to expansion, recession and recovery always occur in that order” (see also Boldin (1996)). Similarly, Dahlquista and Gray (2000) estimate a three-regime model of short-term interest rates for France and Italy while restricting some elements of the transition probability matrix to be zero and David and Veronesi (2013) estimate a six-regime model of inflation, earnings growth, consumption growth, S&P 500 P/E ratios, the three-month Treasury bill rate, and one- and five-year Treasury bond yields in which the transition probability matrix is parameterized with five parameters with some elements restricted to zero. Assumption A1(a) of DMR does not hold in these papers.

The two assumptions imposed by DMR also rule out models in which the conditional density $Y_{k}$ depends on both the current and the lagged regimes. Suppose that we specify $X_{k}$ in (1) as

[TABLE]

where $p\geq 2$ , and $\widetilde{X}_{k}$ follows a first-order Markov process and is called the regime. Then, the transition density of $X_{k}$ inevitably has zeros. For example, when $p=2$ and $X_{k}=(\tilde{X}_{k},\tilde{X}_{k-1})$ , we have $\Pr\left(X_{k+1}=(i^{\prime},j^{\prime})|X_{k}=(i,j)\right)=0$ when $j^{\prime}\neq i$ . Consequently, the asymptotic distribution of the MLE has not been proven for some popular Markov regime switching models including the seminal model of Hamilton (1989) and switching ARCH (SWARCH) model of Hamilton and Susmel (1994).

Example 1 (Hamilton (1989)).

Consider the following model:

[TABLE]

where $\varepsilon_{k}\sim$ i.i.d. $N(0,1)$ and $\tilde{X}_{k}$ follows a Markov chain on $\tilde{\mathcal{X}}=\{1,2,\ldots,M\}$ with $\Pr(\tilde{X}_{k}=j|\tilde{X}_{k-1}=i)=p_{ij}$ , where $M$ represents the number of regimes. Hamilton (1989) estimates model (3) with $M=2$ and $p=5$ by using data on U.S. real GNP growth.

McConnel and Perez-Quiros (2000) and Camacho and Perez-Quiros (2007) estimate an augmented model (3) that allows the standard deviation parameter $\sigma$ in (3) to be regime-dependent.

Example 2 (SWARCH model of Hamilton and Susmel (1994)).

Consider the following model:

[TABLE]

*where $\varepsilon_{k}\sim$ i.i.d. $N(0,1)$ or Student $t$ with $v$ degrees of freedom, and $\tilde{X}_{k}$ follows a Markov chain on $\tilde{\mathcal{X}}=\{1,2,\ldots,M\}$ with $\Pr(\tilde{X}_{k}=j|\tilde{X}_{k-1}=i)=p_{ij}$ . *

Example 3 (Bounce-back effect model of Kim et al. (2005)).

Consider the following model:

[TABLE]

where $\tilde{X}_{k}$ follows a Markov chain on $\tilde{\mathcal{X}}=\{1,2,\ldots,M\}$ with $\Pr(\tilde{X}_{k}=j|\tilde{X}_{k-1}=i)=p_{ij}$ . Kim et al. (2005) use this model with $p=6$ to capture the post-recession “bounce-back” effect in U.S. quarterly GDP.

In examples 1–3, the transition probability of $X_{k}=(\tilde{X}_{k},\ldots,\tilde{X}_{k-p+1})^{\prime}$ has zeros when $p\geq 2$ . Therefore, Assumption A1(a) of DMR is violated. As discussed on pages 2257–2258 of DMR, Assumption A1(a) is crucial for their Corollary 1 (page 2262) that establishes the deterministic geometrically decaying bound on the mixing rate of the conditional chain, $X|Y$ . As DMR recognize on page 2258, this deterministic nature of the bound is vital to their proof of the asymptotic normality of the MLE.

This paper shows the consistency and asymptotic normality of the MLE of the Markov regime switching model in which some elements of the transition probability matrix of $X_{k}$ are zero. To the best of our knowledge, there exists no rigorous proof in the literature of the asymptotic normality of the MLE of these regime switching models, even though empirical researchers often assume that some elements of the transition probability matrix are zero and the models of Hamilton (1989) and Hamilton and Susmel (1994) are popular in applied work. This is an important gap in the literature to be filled because empirical researchers regularly make inferences based on the presumed asymptotic normality (e.g., Goodwin, 1993; Garcia and Perron, 1996; Hamilton and Lin, 1996; Fong, 1997; Ramchand and Susmel, 1998; Maheu and McCurdy, 2000; McConnel and Perez-Quiros, 2000; Edwards and Susmel, 2001; Camacho and Perez-Quiros, 2007). This paper therefore provides the theoretical basis for the statistical inferences associated with these models.

To derive the asymptotic normality of the MLE, we first establish a bound on the mixing rate of the conditional chain, $X|Y$ , in Lemma 1. Our bound is written as a product of random variables, where all but finitely many of them are strictly less than 1. Consequently, the mixing rate of the conditional chain is geometrically decaying almost surely. We then use this mixing rate to show that the sequence of the conditional scores and conditional Hessians given the $m$ past periods converge to the conditional score and conditional Hessian given the “infinite past” as $m\to\infty$ . Given these results, we show the asymptotic normality of the MLE under standard regularity assumptions by applying a martingale central limit theorem to the score function (Proposition 2) as well as by proving a uniform law of large numbers for the observed Fisher information (Proposition 3). These results extend those in DMR to an empirically important class of models where the transition density has the value of zero. Another feature of the present study is that we introduce an additional weakly exogenous regressor, $W_{k}$ .

We also relax the assumption in DMR on the regime-specific density. DMR assume that the regime-specific density is uniformly bounded with respect to $X_{k}$ and $\theta$ , whereas we only assume the existence of the first moment of the supremum of the logarithm of the regime-specific density with respect to $X_{k}$ and $\theta$ . Unbounded densities are used in other analyses (e.g., Zinde-Walsh, 2008) and empirical studies.

Example 4 (Markov regime-switching conditional duration (MS-CD) model).

Consider the following model:

[TABLE]

where $\varepsilon_{k}$ follows the standardized Weibull distribution with $E(\varepsilon_{k})=1$ , $\mu_{X_{k}}$ is a regime-dependent parameter, and $X_{k}\in\{1,2,\ldots,M\}$ is a regime in period $k$ . The density of $Y_{k}$ conditional on $Y_{k-1}$ and $X_{k}$ is given by $g_{\theta}(y_{k}|Y_{k-1},X_{k})=\frac{\gamma}{\lambda_{k}(Y_{k-1},X_{k})}\left(\frac{y_{k}}{\lambda_{k}(Y_{k-1},X_{k})}\right)^{\gamma-1}\exp\left\{-\left(\frac{y_{k}}{\lambda_{k}(Y_{k-1},X_{k})}\right)^{\gamma}\right\}$ , where $\lambda_{k}(Y_{k-1},X_{k})=\frac{\mu_{X_{k}}+\beta Y_{k-1}}{\Gamma(1+1/\gamma)}$ . In this model, the regime-specific density is unbounded when $\gamma<1$ .

Our simulations based on the model (4) show that the asymptotic distribution provides a good approximation of the finite sample behavior of the MLE even when the regime-specific density is unbounded.

In regime switching models, testing for the number of regimes (number of elements in $\mathcal{X}$ ) has been an unsolved problem because the standard asymptotic analysis of the likelihood ratio test statistic (LRTS) breaks down. In testing the null hypothesis of no regime switching, the asymptotic behavior of the LRTS has been investigated by Hansen (1992) and Garcia (1998); Carrasco et al. (2014) propose an information matrix-type test and Cho and White (2007) derive the asymptotic distribution of the quasi-LRTS. Recently, Qu and Zhuo (2017) derive the asymptotic distribution of the LRTS of testing the null hypothesis of no regime switching under some restrictions on the transition probabilities of regimes and Rabah (2012) compares the finite sample performance of the bootstrapped LRTS with the test of Carrasco et al. (2014). Kasahara and Shimotsu (2018) derive the asymptotic distribution of the LRTS for testing the null hypothesis of $M$ regimes against the alternative hypothesis of $M+1$ regimes for any $M\geq 1$ and show the asymptotic validity of the parametric bootstrap.

The remainder of this paper is organized as follows. Section 2 introduces the notation, model, and assumptions. Section 3 derives the bound on the mixing rate of the conditional chain, $X|Y$ . Section 4 derives the consistency of the MLE, and the asymptotic normality of the MLE is shown in Section 5. Section 6 reports the simulation results. Appendix A collects the proofs and Appendix B collects the auxiliary results.

2 Model and assumptions

Our notation largely follows the notation in DMR. Let $:=$ denote “equals by definition.” For a $k\times 1$ vector $x=(x_{1},\ldots,x_{k})^{\prime}$ and a matrix $B$ , define $|x|:=\sqrt{x^{\prime}x}$ and $|B|:=\sqrt{\lambda_{\max}(B^{\prime}B)}$ , where $\lambda_{\max}(B^{\prime}B)$ denotes the largest eigenvalue of $B^{\prime}B$ . For a $k\times 1$ vector $a=(a_{1},\ldots,a_{k})^{\prime}$ and a function $f(a)$ , let $\nabla_{a}^{2}f(a):=\nabla_{aa^{\prime}}f(a)$ . For two probability measures $\mu_{1}$ and $\mu_{2}$ , the total variation distance between $\mu_{1}$ and $\mu_{2}$ is defined as $\|\mu_{1}-\mu_{2}\|_{TV}:=\sup_{A}|\mu_{1}(A)-\mu_{2}(A)|$ . $\|\cdot\|_{TV}$ satisfies $\sup_{f(x):0\leq f(x)\leq 1}|\int f(x)\mu_{1}(dx)-\int f(x)\mu_{2}(dx)|=\|\mu_{1}-\mu_{2}\|_{TV}$ and $\sup_{f(x):\max_{x}|f(x)|\leq 1}|\int f(x)\mu_{1}(dx)-\int f(x)\mu_{2}(dx)|=2\|\mu_{1}-\mu_{2}\|_{TV}$ for any two probability measures $\mu_{1}$ and $\mu_{2}$ (e.g., Levin et al. (2009, Proposition 4.5)). Let $\mathbb{I}\{A\}$ denote an indicator function that takes the value of one when $A$ is true and zero otherwise. For a metric space $\mathcal{A}$ , let $\mathcal{A}^{k}$ denote the $k$ -fold product space of $\mathcal{A}$ . $\mathcal{C}$ denotes a generic finite positive constant whose value may change from one expression to another. Let $a\vee b:=\max\{a,b\}$ and $a\wedge b:=\min\{a,b\}$ . Let $\lfloor x\rfloor$ denote the largest integer less than or equal to $x$ , and define $(x)_{+}:=\max\{x,0\}$ . For any $\{x_{i}\}$ , we define $\sum_{i=a}^{b}x_{i}:=0$ and $\prod_{i=a}^{b}x_{i}:=1$ when $b<a$ . “i.o.” stands for “infinitely often.” All the limits below are taken as $n\to\infty$ unless stated otherwise.

We consider the Markov regime switching process defined by a discrete-time stochastic process $\{(X_{k},Y_{k},W_{k})\}$ , where $(X_{k},Y_{k},W_{k})$ takes the values in a set $\mathcal{X}\times\mathcal{Y}\times\mathcal{W}$ with the associated Borel $\sigma$ -field $\mathcal{B}(\mathcal{X}\times\mathcal{Y}\times\mathcal{W})$ . We use $p_{\theta}(\cdot)$ to denote densities with respect to the probability measure on $\mathcal{B}(\mathcal{X}\times\mathcal{Y}\times\mathcal{W})^{\otimes\mathbb{Z}}$ . For a stochastic process $\{U_{k}\}$ and $a<b$ , define ${\bf U}_{a}^{b}:=(U_{a},U_{a+1},\ldots,U_{b})$ . Denote $\overline{\bf Y}_{k-1}:=(Y_{k-1},\ldots,Y_{k-s})$ for a fixed integer $s$ and $\overline{\bf Y}^{b}_{a}:=(\overline{\bf Y}_{a},\overline{\bf Y}_{a+1},\ldots,\overline{\bf Y}_{b})$ . Define $Z_{k}:=(X_{k},\overline{\bf Y}_{k})$ . Let $Q_{\theta}(x,A):=\mathbb{P}_{\theta}(X_{k}\in A|X_{k-1}=x)$ denote the transition kernel of $\{X_{k}\}_{k=0}^{\infty}$ . Let $Q_{\theta}^{r}(x,A):=\mathbb{P}_{\theta}(X_{k}\in A|X_{k-r}=x)$ denote the $r$ -step transition kernel of $\{X_{k}\}_{k=0}^{\infty}$ .

We now introduce our assumptions, which mainly follow the assumptions in DMR.

Assumption 1.

(a) The parameter $\theta$ belongs to $\Theta$ , a compact subset of $\mathbb{R}^{q}$ , and the true parameter value $\theta^{*}$ lies in the interior of $\Theta$ . (b) $\{X_{k}\}_{k=0}^{\infty}$ is a Markov chain that lies in a compact set $\mathcal{X}\subset\mathbb{R}^{d_{x}}$ . (c) For all $\theta\in\Theta$ , $Q_{\theta}(x,\cdot)$ and $Q_{\theta}^{r}(x,\cdot)$ have densities $q_{\theta}(x,\cdot)$ and $q_{\theta}^{r}(x,\cdot)$ , respectively, with respect to a finite dominating measure $\mu$ on $\mathcal{B}(\mathcal{X})$ such that $\mu(\mathcal{X})=1$ , and $\sigma_{+}^{0}:=\sup_{\theta\in\Theta}\sup_{x,x^{\prime}\in\mathcal{X}}q_{\theta}(x,x^{\prime})<\infty$ . (d) There exists a finite $p\geq 1$ such that $0<\sigma_{-}:=\inf_{\theta\in\Theta}\inf_{x_{k-p},x_{k}\in\mathcal{X}}q_{\theta}^{p}(x_{k-p},x_{k})$ and $\sigma_{+}:=\sup_{\theta\in\Theta}\sup_{x_{k-p},x_{k}\in\mathcal{X}}q_{\theta}^{p}(x_{k-p},x_{k})<\infty$ . (e) $\{(Y_{k},W_{k})\}_{k=-s+1}^{\infty}$ takes the values in a set $\mathcal{Y}\times\mathcal{W}\subset\mathbb{R}^{d_{y}}\times\mathbb{R}^{d_{w}}$ .

Assumption 2.

(a) For each $k\geq 1$ , $X_{k}$ is conditionally independent of $({\bf X}_{0}^{k-2},\overline{\bf Y}_{0}^{k-1},{\bf W}_{0}^{\infty})$ given $X_{k-1}$ . (b) For each $k\geq 1$ , $Y_{k}$ is conditionally independent of $({\bf Y}^{k-s-1}_{-s+1},{\bf X}_{0}^{k-1},{\bf W}_{0}^{k-1},{\bf W}_{k+1}^{\infty})$ given $(\overline{\bf Y}_{k-1},X_{k},W_{k})$ , and the model of the conditional distribution of $Y_{k}$ has a density $g_{\theta}(y_{k}|\overline{\bf Y}_{k-1},X_{k},W_{k})$ with respect to a $\sigma$ -finite measure $\nu$ on $\mathcal{B}(\mathcal{Y})$ . (c) ${\bf W}_{1}^{\infty}$ is conditionally independent of $(\overline{\bf Y}_{0},X_{0})$ given $W_{0}$ . (d) $\{(Z_{k},W_{k})\}_{k=0}^{\infty}$ is a strictly stationary ergodic process.

Assumption 3.

For all $y^{\prime}\in\mathcal{Y}$ , $\overline{\bf y}\in\mathcal{Y}^{s}$ , and $w\in\mathcal{W}$ , $0<\inf_{\theta\in\Theta}\inf_{x\in\mathcal{X}}g_{\theta}(y^{\prime}|\overline{\bf y},x,w)$ and $\sup_{\theta\in\Theta}\sup_{x\in\mathcal{X}}g_{\theta}(y^{\prime}|\overline{\bf y},x,w)<\infty$ .

Assumption 1(c) is also assumed on page 2258 of DMR. This assumption excludes the case where $\mathcal{X}=\mathbb{R}$ and $\mu$ is the Lebesgue measure but allows for continuously distributed $X_{k}$ with finite support. If multiple $p$ ’s satisfy Assumption 1(d), we define $p$ as the minimum of such $p$ ’s. Assumption 1(d) implies that the state space $\mathcal{X}$ of the Markov chain $\{X_{k}\}$ is $\nu_{p}$ -small for some nontrivial measure $\nu_{p}$ on $\mathcal{B}(\mathcal{X})$ . Therefore, for all $\theta\in\Theta$ , the chain $\{X_{k}\}$ is aperiodic and has a unique invariant distribution and is uniformly ergodic (Meyn and Tweedie, 2009, Theorem 16.0.2). Assumptions 2(a)(b) imply that $Z_{k}$ is conditionally independent of $({\bf Z}_{0}^{k-2},{\bf W}_{0}^{k-1},{\bf W}_{k+1}^{\infty})$ given $(Z_{k-1},W_{k})$ ; hence, $\{Z_{k}\}_{k=0}^{\infty}$ is a Markov chain on $\mathcal{Z}:=\mathcal{X}\times\mathcal{Y}^{s}$ given $\{W_{k}\}_{k=0}^{\infty}$ . Under Assumptions 2(a)–(c), the conditional density of ${\bf Z}_{0}^{n}$ given ${\bf W}_{0}^{n}$ is written as $p_{\theta}({\bf Z}_{0}^{n}|{\bf W}_{0}^{n})=p_{\theta}(Z_{0}|W_{0})\prod_{k=1}^{n}p_{\theta}(Z_{k}|Z_{k-1},W_{k})$ . Because $\{(Z_{k},W_{k})\}_{k=0}^{\infty}$ is stationary, we extend $\{(Z_{k},W_{k})\}_{k=0}^{\infty}$ to a stationary process $\{(Z_{k},W_{k})\}_{k=-\infty}^{\infty}$ with doubly infinite time. We denote the probability and associated expectation of $\{(Z_{k},W_{k})\}_{k=\infty}^{\infty}$ under stationarity by $\mathbb{P}_{\theta}$ and $\mathbb{E}_{\theta}$ , respectively.111DMR use $\overline{\mathbb{P}}_{\theta}$ and $\overline{\mathbb{E}}_{\theta}$ to denote the probability and expectation under stationarity because their Section 7 deals with the case when $Z_{0}$ is drawn from an arbitrary distribution. Because we assume $\{(Z_{k},W_{k})\}_{k=\infty}^{\infty}$ is stationary throughout this paper, we use notations such as $\mathbb{P}_{\theta}$ and $\mathbb{E}_{\theta}$ without an overline for simplicity. Assumption 3 is stronger than Assumption A1(b) in DMR, which assumes only $0<\inf_{\theta\in\Theta}\int_{x\in\mathcal{X}}g_{\theta}(y^{\prime}|\overline{\bf y},x)\mu(dx)$ and $\sup_{\theta\in\Theta}\int_{x\in\mathcal{X}}g_{\theta}(y^{\prime}|\overline{\bf y},x)\mu(dx)<\infty$ . When $\mathcal{X}$ is finite, Assumption 3 becomes identical to Assumption A3 of Francq and Roussignol (1998), who prove the consistency of the MLE when $\mathcal{X}$ is finite. It appears that assuming a lower bound on $g_{\theta}$ similar to Assumption 3 is necessary to derive the asymptotics of the MLE when $\inf_{\theta}\inf_{x,x^{\prime}}q_{\theta}(x,x^{\prime})=0$ . When $p=1$ , we could weaken Assumption 3 to Assumption A1(b) in DMR, but we retain Assumption 3 to simplify the exposition and proof.

DMR assume $p=1$ in Assumption 1(d), meaning that the transition density $q_{\theta}(x,x^{\prime})$ of the state variable $X_{k}$ is uniformly bounded from below. DMR show that this lower bound on $q_{\theta}(x,x^{\prime})$ translates into a deterministic lower bound on the conditional transition density of $X_{k}$ given the observations of $\{(\overline{\bf Y}_{k},W_{k})\}_{k=0}^{n}$ . Owing to this deterministic lower bound, the chain $\{X_{k}\}$ given $\{(\overline{\bf Y}_{k},W_{k})\}_{k=0}^{n}$ is geometrically mixing, and, consequently, the derivatives of the log-densities are also geometrically mixing and follow the law of large numbers and central limit theorem.

When $p\geq 2$ , this lower bound is no longer deterministic and depends on the $Y_{k}$ ’s. For example, suppose that $\mathcal{X}=\{-1,0,1\}$ , which correspond to “recession,” “normal,” and “expansion” periods, respectively, $\mathbb{P}(X_{k}=1|X_{k-1}=-1)=0$ , and $Y_{k}|(X_{k},Y_{k-1})\sim N(0.6Y_{k-1}+X_{k},1)$ . Then, observing a negative value of $Y_{k-1}$ implies that the likely value of $X_{k-1}$ is $-1$ , which in turn implies that the event $\{X_{k}=1\}$ is unlikely. As $Y_{k-1}$ approaches negative infinity, $\mathbb{P}(X_{k}=1|Y_{k},Y_{k-1})$ approaches zero and no lower bound on the transition density of $X_{k-1}$ exists given $(Y_{k},Y_{k-1})$ .

We overcome the zero lower bound on $q_{\theta}(x,x^{\prime})$ by noting that, in many econometric models, only extreme values of $Y_{k-1}$ provide a strong signal on the value of $X_{k-1}$ . Because $Y_{k-1}$ takes such extreme values with a small probability, the transition probability of the chain $\{X_{k}\}$ given $\{(\overline{\bf Y}_{k},W_{k})\}_{k=0}^{n}$ is bounded from below by a stochastic lower bound whose value is close to zero only with a small probability. As a result, the chain $\{X_{k}\}$ given $\{(\overline{\bf Y}_{k},W_{k})\}_{k=0}^{n}$ is geometrically mixing with a probability close to one.

Following DMR, we analyze the conditional log-likelihood function given $\overline{\bf Y}_{0}$ , ${\bf W}_{0}^{n}$ , and $X_{0}=x_{0}$ rather than the stationary log-likelihood function given $\overline{\bf Y}_{0}$ and ${\bf W}_{0}^{n}$ because, as explained in DMR (pages 2263–2264), the conditional initial density $p_{\theta}(X_{0}|\overline{\bf Y}_{0}^{k-1})$ cannot be easily computed in practice. The conditional density function of ${\bf Y}_{1}^{n}$ is

[TABLE]

where $p_{\theta}(y_{k},x_{k}|\overline{\bf{y}}_{k-1},x_{k-1},w_{k})=q_{\theta}(x_{k-1},x_{k})g_{\theta}(y_{k}|\overline{\bf{y}}_{k-1},x_{k},w_{k})$ . Assumptions 2(a)–(c) imply that for $k\geq 1$ , $W_{k}$ is conditionally independent of ${\bf Z}_{0}^{k-1}$ given ${\bf W}_{0}^{k-1}$ because $p(W_{k}|{\bf Z}_{0}^{k-1},{\bf W}_{0}^{k-1})=p({\bf W}_{0}^{k},{\bf Z}_{0}^{k-1})/p({\bf W}_{0}^{k-1},{\bf Z}_{0}^{k-1})$ and for $j=k,k-1$ , $p({\bf W}_{0}^{j},{\bf Z}_{0}^{k-1})=p(Z_{0},{\bf W}_{0}^{j})\prod_{t=1}^{k-1}p(Z_{t}|Z_{t-1},W_{t})=p({\bf W}_{1}^{j}|W_{0})p(Z_{0}|W_{0})\prod_{t=1}^{k-1}p(Z_{t}|Z_{t-1},W_{t})$ . Therefore, for $1\leq k\leq n$ , we have

[TABLE]

In view of (6) and (7), we can write the conditional and stationary log-likelihood functions as

[TABLE]

Many applications use the log-likelihood function in which the conditional density $p_{\theta}({\bf Y}_{1}^{n}|\overline{\bf Y}_{0},{\bf W}_{0}^{n},x_{0})$ is integrated with respect to $x_{0}$ over a probability measure $\xi$ on $\mathcal{B}(\mathcal{X})$ , where $\xi$ can be fixed or treated as an additional parameter. We also analyze the resulting objective function:

[TABLE]

3 Uniform forgetting of the conditional hidden Markov chain

In this section, we establish a mixing rate of the conditional hidden Markov chain, which is the chain $\{X_{k}\}_{k=-m}^{n}$ given $(\overline{\bf Y}_{-m}^{n},{\bf W}_{-m}^{n})$ . The bounds on this mixing rate are instrumental in deriving the asymptotic properties of the MLE. The following lemma bounds the distance between the distributions of $X_{k}$ given $(\overline{\bf Y}_{-m}^{n},{\bf W}_{-m}^{n})$ when starting from two different initial distributions $\mu_{1}(\cdot)$ and $\mu_{2}(\cdot)$ of $X_{-m}$ . In other words, this lemma provides the rate at which the conditional hidden Markov chain forgets its past. This lemma generalizes Corollary 1 of DMR, which shows that the conditional hidden Markov chain forgets its past at a deterministic exponential rate when $p=1$ . As DMR note on page 2258, their deterministic rate holds only when $p=1$ .

Lemma 1.

Assume Assumptions 1–3. Let $m,n\in\mathbb{Z}$ with $-m\leq n$ and $\theta\in\Theta$ . Then, for all $-m\leq k\leq n$ , all probability measures $\mu_{1}$ and $\mu_{2}$ on $\mathcal{B}(\mathcal{X})$ , and all $(\overline{\bf y}_{-m}^{n},{\bf w}_{-m}^{n})$ ,

[TABLE]

where $\omega(\overline{\bf y}_{k-p}^{k-1},{\bf w}_{k-p}^{k-1}):=\sigma_{-}/\sigma_{+}$ when $p=1$ , and, when $p\geq 2$ ,222Strictly speaking, $w_{k-p}$ in $\omega(\overline{\bf y}_{k-p}^{k-1},{\bf w}_{k-p}^{k-1})$ is superfluous because $\omega(\overline{\bf y}_{k-p}^{k-1},{\bf w}_{k-p}^{k-1})$ does not depend on $w_{k-p}$ . We retain $w_{k-p}$ for notational simplicity.

[TABLE]

The convergence rate of the conditional hidden Markov chain depends on the minorization coefficient $\omega(\overline{\bf Y}_{k-p}^{k-1},{\bf W}_{k-p}^{k-1})$ . If this coefficient is bounded away from 0, the chain forgets its past exponentially fast. When $p\geq 2$ , this coefficient is not necessarily bounded away from 0 because $\inf_{\overline{\bf y}_{k-p}^{k-1},{\bf w}_{k-p}^{k-1}}\omega(\overline{\bf y}_{k-p}^{k-1},{\bf w}_{k-p}^{k-1})$ can be possibly 0. However, $\omega(\overline{\bf Y}_{k-p}^{k-1},{\bf W}_{k-p}^{k-1})$ becomes close to zero only when ${\bf Y}_{k-p+1}^{k-1}$ takes an unlikely value because the denominator of $\omega(\overline{\bf Y}_{k-p}^{k-1},{\bf W}_{k-p}^{k-1})$ is finite and the numerator of $\omega(\overline{\bf Y}_{k-p}^{k-1},{\bf W}_{k-p}^{k-1})$ is a product of the conditional density $g_{\theta}(y|\overline{\bf y},x,w)$ . As a result, $\omega(\overline{\bf Y}_{k-p}^{k-1},{\bf W}_{k-p}^{k-1})$ is bounded away from 0 with a probability close to 1. In the following sections, we use this fact to establish the consistency and asymptotic normality of the MLE.

4 Consistency of the MLE

Define the conditional MLE of $\theta^{*}$ given $\overline{\bf Y}_{0}$ , ${\bf W}_{0}^{n}$ , and $X_{0}=x_{0}$ as

[TABLE]

with $l_{n}(\theta,x_{0})$ defined in (8). In this section, we prove the consistency of the conditional MLE. We introduce additional assumptions required for proving consistency.

Assumption 4.

*(a) $\mathbb{E}_{\theta^{*}}|\log b_{+}(\overline{\bf Y}_{0}^{1},W_{1})|<\infty$ , where

$b_{+}(\overline{\bf Y}_{k-1}^{k},W_{k}):=\sup_{\theta\in\Theta}\sup_{x_{k}\in\mathcal{X}}g_{\theta}(Y_{k}|\overline{\bf Y}_{k-1},x_{k},W_{k})$ . (b) $\mathbb{E}_{\theta^{*}}|\log b_{-}(\overline{\bf Y}_{0}^{1},W_{1})|<\infty$ , where

$b_{-}(\overline{\bf Y}_{k-1}^{k},W_{k}):=\inf_{\theta\in\Theta}\inf_{x_{k}\in\mathcal{X}}g_{\theta}(Y_{k}|\overline{\bf Y}_{k-1},x_{k},W_{k})$ .*

Assumption 5.

There exist constants $\alpha>0$ , $C_{1},C_{2}\in(0,\infty)$ , and $\beta>1$ such that, for any $r>0$ ,

[TABLE]

Assumption 4(a) relaxes Assumption (A3) of DMR, who assume that

$\sup_{\theta\in\Theta}\sup_{y_{1},\overline{\bf y}_{0},x,w}g_{\theta}(y_{1}|\overline{\bf y}_{0},x,w)<\infty$ and hence the density is uniformly bounded. Assumption 4(b) is stronger than Assumption (A3) of DMR, who assume $\mathbb{E}_{\theta^{*}}|\log(\inf_{\theta\in\Theta}\int g_{\theta}(Y_{1}|\overline{\bf Y}_{0},x)\mu(dx)|<\infty$ . Assumption 4 implies that $\mathbb{E}_{\theta^{*}}\sup_{\theta\in\Theta}\sup_{x\in\mathcal{X}}|\log(g_{\theta}(Y_{1}|\overline{\bf Y}_{0},x,W_{1}))|<\infty$ , which is similar to the moment condition used in the standard maximum likelihood estimation, but the infimum is taken over $x$ in addition to $\theta$ . Assumption 5 restricts the probability that

$\sup_{\theta\in\Theta}\sup_{x_{k}\in\mathcal{X}}g_{\theta}(Y_{k}|\overline{\bf Y}_{k-1},x_{k},W_{k})/\inf_{\theta\in\Theta}\inf_{x_{k}\in\mathcal{X}}g_{\theta}(Y_{k}|\overline{\bf Y}_{k-1},x_{k},W_{k})$ takes an extremely large value. Assumption 5 is not restrictive because the right hand side of the inequality inside $\mathbb{P}_{\theta^{*}}(\cdot)$ is exponential in $r$ and the bound $C_{2}r^{-\beta}$ is a polynomial in $r$ . An easily verifiable sufficient condition for Assumption 5 is $\mathbb{E}_{\theta^{*}}|\log(b_{+}(\overline{\bf Y}_{0}^{1},W_{1})/b_{-}(\overline{\bf Y}_{0}^{1},W_{1}))|^{1+\delta}<\infty$ for some $\delta>0$ . This is because $\mathbb{P}_{\theta^{*}}(b_{+}(\overline{\bf Y}_{0}^{1},W_{1})/b_{-}(\overline{\bf Y}_{0}^{1},W_{1})\geq e^{\alpha r})=\mathbb{P}_{\theta^{*}}(\log(b_{+}(\overline{\bf Y}_{0}^{1},W_{1})/b_{-}(\overline{\bf Y}_{0}^{1},W_{1}))\geq\alpha r)$

$\leq(\mathbb{E}_{\theta^{*}}|\log(b_{+}(\overline{\bf Y}_{0}^{1},W_{1})/b_{-}(\overline{\bf Y}_{0}^{1},W_{1}))|^{1+\delta})/(\alpha r)^{1+\delta}\leq C_{2}r^{-(1+\delta)}$ , where the first inequality follows from Markov’s inequality. Examples 1–4 satisfy Assumptions 4 and 5.

In the following lemma, we show that the difference between the conditional log-likelihood function $l_{n}(\theta,x_{0})$ and the stationary log-likelihood function $l_{n}(\theta)$ is $o(n)$ $\mathbb{P}_{\theta^{*}}$ -a.s.

Lemma 2.

Assume Assumptions 1–5. Then,

[TABLE]

When $p=1$ , Lemma 2 of DMR shows that $\sup_{\theta\in\Theta}|l_{n}(\theta,x_{0})-l_{n}(\theta)|$ is bounded by a deterministic constant. When $p\geq 2$ , Lemma 2 of DMR is no longer applicable because $|l_{n}(\theta,x_{0})-l_{n}(\theta)|$ depends on the products of $1-\omega(\overline{\bf Y}_{pi-p}^{pi-1},{\bf W}_{pi-p}^{pi-1})$ ’s for $i=1,\ldots,\lfloor n/p\rfloor$ . A key observation is that $\{\omega(\overline{\bf Y}_{pi-p}^{pi-1},{\bf W}_{pi-p}^{pi-1})\}_{i\geq 1}$ is stationary and ergodic and that $\epsilon:=\mathbb{P}_{\theta^{*}}(\omega(\overline{\bf Y}_{pi-p}^{pi-1},{\bf W}_{pi-p}^{pi-1})\leq\delta)$ is small when $\delta>0$ is sufficiently small. Because the strong law of large numbers implies that $(\lfloor n/p\rfloor)^{-1}\sum_{i=1}^{\lfloor n/p\rfloor}\mathbb{I}\{\omega(\overline{\bf Y}_{pi-p}^{pi-1},{\bf W}_{pi-p}^{pi-1})>\delta\}$ converges to $1-\epsilon$ $\mathbb{P}_{\theta^{*}}$ -a.s., $1-\omega(\overline{\bf Y}_{pi-p}^{pi-1},{\bf W}_{pi-p}^{pi-1})\leq 1-\delta$ holds for a large fraction of the $\omega(\overline{\bf Y}_{pi-p}^{pi-1},{\bf W}_{pi-p}^{pi-1})$ ’s. Consequently, we can establish a $\mathbb{P}_{\theta^{*}}$ -a.s. bound on $n^{-1}|l_{n}(\theta,x_{0})-l_{n}(\theta)|$ .

We proceed to show that, for all $\theta\in\Theta$ , $p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{-m},{\bf W}_{-m}^{k})$ converges to $p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{-\infty},{\bf W}_{-\infty}^{k})$ $\mathbb{P}_{\theta^{*}}$ -a.s. as $m\to\infty$ and that we can approximate $n^{-1}l_{n}(\theta)$ by $n^{-1}\sum_{k=1}^{n}\log p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{-\infty},{\bf W}_{-\infty}^{k})$ , which is the sample average of the stationary ergodic random variables. For $x\in\mathcal{X}$ and $m\geq 0$ , define

[TABLE]

so that $l_{n}(\theta)=\sum_{k=1}^{n}\Delta_{k,0}(\theta)$ . The following proposition corresponds to Lemma 3 of DMR. This proposition shows that, for any $k\geq 0$ , the sequences $\{\Delta_{k,m}(\theta)\}_{m\geq 0}$ and $\{\Delta_{k,m,x}(\theta)\}_{m\geq 0}$ are Cauchy uniformly in $\theta\in\Theta$ .

Lemma 3.

Assume Assumptions 1–5. Then, there exist a constant $\rho\in(0,1)$ and random sequences $\{A_{k,m}\}_{k\geq 1,m\geq 0}$ and $\{B_{k}\}_{k\geq 1}$ such that, for all $1\leq k\leq n$ and $m^{\prime}\geq m\geq 0$ ,

[TABLE]

where $\mathbb{P}_{\theta^{*}}\left(A_{k,m}\geq M\ \text{i.o.}\right)=0$ for a constant $M<\infty$ and $B_{k}\in L^{1}(\mathbb{P}_{\theta^{*}})$ .

Lemma 3(a) implies that $\{\Delta_{k,m,x}(\theta)\}_{m\geq 0}$ is a uniform Cauchy sequence in $\theta\in\Theta$ with probability one and that $\lim_{m\to\infty}\Delta_{k,m,x}(\theta)$ does not depend on $x$ . Let $\Delta_{k,\infty}(\theta)$ denote this limit. Because $\{\Delta_{k,m,x}(\theta)\}_{m\geq 0}$ is uniformly bounded in $L^{1}(\mathbb{P}_{\theta^{*}})$ from Lemma 3(c), $\{\Delta_{k,m,x}(\theta)\}_{m\geq 0}$ converges to $\Delta_{k,\infty}(\theta)$ in $L^{1}(\mathbb{P}_{\theta^{*}})$ and $\Delta_{k,\infty}(\theta)\in L^{1}(\mathbb{P}_{\theta^{*}})$ by the dominated convergence theorem. Define $l(\theta):=\mathbb{E}_{\theta^{*}}[\Delta_{0,\infty}(\theta)]$ . Lemma 3 also implies that $n^{-1}l_{n}(\theta)$ converge to $n^{-1}\sum_{k=1}^{n}\Delta_{k,\infty}(\theta)$ , which converges to $l(\theta)$ by the ergodic theorem. Therefore, the consistency of $\hat{\theta}_{x_{0}}$ is proven if this convergence of $n^{-1}l_{n}(\theta)-l(\theta)$ is strengthened to uniform convergence in $\theta\in\Theta$ and the additional regularity conditions are confirmed.

We introduce additional assumptions on the continuity of $q_{\theta}$ and $g_{\theta}$ and identification of $\theta^{*}$ .

Assumption 6.

(a) For all $(\overline{\bf y},y^{\prime},w)\in\mathcal{Y}^{s}\times\mathcal{Y}\times\mathcal{W}$ and uniformly in $x,x^{\prime}\in\mathcal{X}$ , $q_{\theta}(x,x^{\prime})$ and $g_{\theta}(y^{\prime}|\overline{\bf y},x,w)$ are continuous in $\theta$ . (b) $\mathbb{P}_{\theta^{*}}[p_{\theta^{*}}(Y_{1}|\overline{\bf{Y}}_{-m}^{0},{\bf W}_{-m}^{1})\neq p_{\theta}(Y_{1}|\overline{\bf{Y}}_{-m}^{0},{\bf W}_{-m}^{1})]>0$ for all $m\geq 0$ and all $\theta\in\Theta$ such that $\theta\neq\theta^{*}$ .

Assumption 6(b) is a high-level assumption because it is imposed on $p_{\theta}(Y_{1}|\overline{\bf{Y}}_{-m}^{0},{\bf W}_{-m}^{1})$ . When the covariate $W_{k}$ is absent, DMR prove consistency under a lower-level assumption (their (A5*′*)), which is stated in terms of $p_{\theta}({\bf Y}_{1}^{n}|\overline{\bf{Y}}_{0})$ . We use Assumption 6(b) for brevity.

The following proposition shows the strong consistency of the (conditional) MLE.

Proposition 1.

Assume Assumptions 1–6. Then, $\sup_{x_{0}\in\mathcal{X}}|\hat{\theta}_{x_{0}}-\theta^{*}|\to 0$ $\mathbb{P}_{\theta^{*}}$ -a.s.

Francq and Roussignol (1998, Theorem 3) prove the consistency of the MLE when the state space of $X_{k}$ is finite. Proposition 1 generalizes Theorem 3 of Francq and Roussignol (1998) in the following three aspects. First, we allow $X_{k}$ to be continuously distributed. Second, we analyze the log-likelihood function conditional on $X_{0}=x_{0}$ , whereas Francq and Roussignol (1998) set the initial distribution of $X_{1}$ to any probability vector with strictly positive elements. In other words, we allow for zeros in the postulated initial distribution of $\{X_{k}\}$ . Third, we allow for an exogenous covariate $\{W_{k}\}_{k=0}^{n}$ . Leroux (1992), Le Gland and Mevel (2000), and Douc and Matias (2001) analyze the asymptotic property of the MLE of hidden Markov models, which are the special case of the model considered here in that $g_{\theta}(Y_{k}|\overline{\bf Y}_{k-1},X_{k},W_{k})$ does not depend on $\overline{\bf Y}_{k-1}$ .

Define the MLE with a probability measure $\xi$ on $\mathcal{B}(\mathcal{X})$ for $x_{0}$ as $\hat{\theta}_{\xi}:=\operatorname*{\arg\!\max}_{\theta\in\Theta}l_{n}(\theta,\xi)$ with $l_{n}(\theta,\xi)$ defined in (9). Proposition 1 implies the following corollary.

Corollary 1.

Assume Assumptions 1–6. Then, for any $\xi$ , $\hat{\theta}_{\xi}\to\theta^{*}$ $\mathbb{P}_{\theta^{*}}$ -a.s.

5 Asymptotic distribution of the MLE

In this section, we derive the asymptotic distribution of the MLE and consistency of the asymptotic covariance matrix estimate. Because $\hat{\theta}_{x_{0}}$ is consistent, expanding the first-order condition $\nabla_{\theta}l_{n}(\hat{\theta}_{x_{0}},x_{0})=0$ around $\theta^{*}$ gives

[TABLE]

where $\overline{\theta}\in[\theta^{*},\hat{\theta}_{x_{0}}]$ and $\overline{\theta}$ may take different values across different rows of $\nabla_{\theta}^{2}l_{n}(\overline{\theta},x_{0})$ . In the following, we approximate $\nabla_{\theta}^{j}l_{n}(\theta,x_{0})=\sum_{k=1}^{n}\nabla_{\theta}^{j}\log p_{\theta}(Y_{k}|\overline{\bf Y}_{0}^{k-1},{\bf W}_{0}^{k},X_{0}=x_{0})$ for $j=1,2$ by $\sum_{k=1}^{n}\nabla_{\theta}^{j}\log p_{\theta}(Y_{k}|\overline{\bf Y}_{-\infty}^{k-1},{\bf W}_{-\infty}^{k})$ , which is a sum of a stationary process. We then apply the central limit theorem and law of large numbers to $n^{-j/2}\sum_{k=1}^{n}\nabla_{\theta}^{j}\log p_{\theta}(Y_{k}|\overline{\bf Y}_{-\infty}^{k-1},{\bf W}_{-\infty}^{k})$ . A similar expansion gives the asymptotic distribution of $n^{1/2}(\hat{\theta}_{\xi}-\theta^{*})$ .

We introduce additional assumptions. Define $\mathcal{X}_{\theta}^{+}:=\{(x,x^{\prime})\in\mathcal{X}^{2}:q_{\theta}(x,x^{\prime})>0\}$ .

Assumption 7.

There exists a constant $\delta>0$ such that the following conditions hold on $G:=\{\theta\in\Theta:|\theta-\theta^{*}|<\delta\}$ : (a) For all $(\overline{\bf{y}},y^{\prime},w,x,x^{\prime})\in\mathcal{Y}^{s}\times\mathcal{Y}\times\mathcal{W}\times\mathcal{X}\times\mathcal{X}$ , the functions $g_{\theta}(y^{\prime}|\overline{\bf y},w,x)$ and $q_{\theta}(x,x^{\prime})$ are twice continuously differentiable in $\theta\in G$ . (b) $\sup_{\theta\in G}\sup_{x,x^{\prime}\in\mathcal{X}_{\theta}^{+}}|\nabla_{\theta}\log q_{\theta}(x,x^{\prime})|<\infty$ and $\sup_{\theta\in G}\sup_{x,x^{\prime}\in\mathcal{X}_{\theta}^{+}}|\nabla_{\theta}^{2}\log q_{\theta}(x,x^{\prime})|<\infty$ . (c) $\mathbb{E}_{\theta^{*}}[\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\nabla_{\theta}\log g_{\theta}(Y_{1}|\overline{\bf{Y}}_{0},x,W_{1})|^{2}]<\infty$ and $\mathbb{E}_{\theta^{*}}[\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\nabla_{\theta}^{2}\log g_{\theta}(Y_{1}|\overline{\bf{Y}}_{0},x,W_{1})|]<\infty$ . (d) For almost all $(\overline{\bf{y}},y^{\prime},w)\in\mathcal{Y}^{s}\times\mathcal{Y}\times\mathcal{W}$ , there exists a function $f_{\overline{\bf{y}},y^{\prime},w}:\mathcal{X}\rightarrow\mathbb{R}^{+}$ in $L^{1}(\mu)$ such that $\sup_{\theta\in G}g_{\theta}(y^{\prime}|\overline{\bf y},x,w)\leq f_{\overline{\bf y},y^{\prime},w}(x)$ . (e) For almost all $(x,\overline{\bf{y}},w)\in\mathcal{X}\times\mathcal{Y}^{s}\times\mathcal{W}$ and $j=1,2$ , there exist functions $f^{j}_{x,\overline{\bf y},w}:\mathcal{Y}\rightarrow\mathbb{R}^{+}$ in $L^{1}(\nu)$ such that $|\nabla_{\theta}^{j}g_{\theta}(y^{\prime}|\overline{\bf y},x,w)|\leq f^{j}_{x,\overline{\bf y},w}(y^{\prime})$ for all $\theta\in G$ .

Assumption 8.

$\mathbb{E}_{\theta^{*}}[\sup_{m\geq 0}\sup_{\theta\in G}|\nabla_{\theta}\log p_{\theta}(Y_{1}|\overline{\bf{Y}}^{0}_{-m},{\bf W}^{1}_{-m})|^{2}]<\infty$ *,

$\mathbb{E}_{\theta^{*}}[\sup_{m\geq 0}\sup_{\theta\in G}|\nabla_{\theta}^{2}\log p_{\theta}(Y_{1}|\overline{\bf{Y}}^{0}_{-m},{\bf W}^{1}_{-m})|]<\infty$ ,

$\mathbb{E}_{\theta^{*}}[\sup_{m\geq 0}\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\nabla_{\theta}\log p_{\theta}(Y_{1}|\overline{\bf{Y}}^{0}_{-m},{\bf W}^{1}_{-m},X_{-m}=x)|^{2}]<\infty$ , and

$\mathbb{E}_{\theta^{*}}[\sup_{m\geq 0}\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\nabla_{\theta}^{2}\log p_{\theta}(Y_{1}|\overline{\bf{Y}}^{0}_{-m},{\bf W}^{1}_{-m},X_{-m}=x)|]<\infty$ .*

Assumption 7 is the same as Assumptions (A6)–(A8) of DMR except for accommodating the case $\inf_{(x,x^{\prime})\in\mathcal{X}^{2}}q_{\theta}(x,x^{\prime})=0$ and the covariate $W$ . In Assumption 7(b), the supremum is taken over $\mathcal{X}_{\theta}^{+}$ because $\nabla_{\theta}\log q_{\theta}(x,x^{\prime})$ and $\nabla_{\theta}^{2}\log q_{\theta}(x,x^{\prime})$ are not well-defined when $q_{\theta}(x,x^{\prime})=0$ . Examples 1–4 satisfy Assumption 7. Assumption 8 is a high-level assumption that bounds the moments of $\nabla_{\theta}^{j}\log p_{\theta}(Y_{k}|\overline{\bf Y}_{-m}^{k-1},{\bf W}_{-m}^{k})$ and $\nabla_{\theta}^{j}\log p_{\theta}(Y_{k}|\overline{\bf Y}_{-m}^{k-1},{\bf W}_{-m}^{k},X_{-m}=x)$ uniformly in $m$ . When $p=1$ , DMR could derive Assumption 8 by using the $L^{3-j}(\mathbb{P}_{\theta^{*}})$ convergence of $\nabla_{\theta}^{j}\log p_{\theta}(Y_{k}|\overline{\bf Y}_{-m}^{k-1},{\bf W}_{-m}^{k})$ and $\nabla_{\theta}^{j}\log p_{\theta}(Y_{k}|\overline{\bf Y}_{-m}^{k-1},{\bf W}_{-m}^{k},X_{-m}=x)$ to $\nabla_{\theta}^{j}\log p_{\theta}(Y_{k}|\overline{\bf Y}_{-\infty}^{k-1},{\bf W}_{-\infty}^{k})$ as $m\to\infty$ . When $p\geq 2$ , we need to assume Assumption 8 because our Lemma 4 only shows that these sequences converge to $\nabla_{\theta}^{j}\log p_{\theta}({\bf Y}_{k}|\overline{\bf Y}^{k-1}_{-\infty},{\bf W}_{-\infty}^{k})$ in probability.

5.1 Asymptotic distribution of the score function

This section derives the asymptotic distribution of $n^{-1/2}\nabla_{\theta}l_{n}(\theta^{*},x_{0})$ and $n^{-1/2}\nabla_{\theta}l_{n}(\theta^{*},\xi)$ . We introduce a result known as the Louis missing information principle (Louis, 1982), which expresses the derivatives of the log-likelihood function of a latent variable model in terms of the conditional expectation of the derivatives of the complete data log-likelihood function. Let $(X,Y,W)$ be random variables with $p_{\theta}(y,x|w)$ denoting the joint density of $(Y,X)$ given $W$ , and let $p_{\theta}(y|w)$ be the marginal density of $Y$ given $W$ . Then, a straightforward differentiation that is valid under Assumption 7 gives $\nabla_{\theta}\log p_{\theta}(Y|W)=\mathbb{E}_{\theta}\left[\nabla_{\theta}\log p_{\theta}(Y,X|W)\middle|Y,W\right]$ . In terms of the variables in our model, we have, for any $k\geq 1$ and $m\geq 0$ ,

[TABLE]

where the last equality follows from Assumption 2.

Define $\overline{\bf Z}_{k-1}^{k}:=(Y_{k},X_{k},\overline{\bf Y}_{k-1},X_{k-1})$ . For $j=1,2$ , denote the derivatives of the complete data log-density of $(Y_{k},X_{k})$ given $(\overline{\bf Y}_{k-1},X_{k-1},W_{k})$ by

[TABLE]

We use a short-handed notation $\phi^{j}_{\theta k}:=\phi^{j}(\theta,\overline{\bf Z}_{k-1}^{k},W_{k})$ . We also suppress the superscript $1$ from $\phi^{1}_{\theta k}$ , so that $\phi_{\theta k}=\phi^{1}_{\theta k}$ . Let $|\phi^{j}_{k}|_{\infty}:=\sup_{\theta\in G}\sup_{x,x^{\prime}\in\mathcal{X}_{\theta}^{+}}|\nabla_{\theta}^{j}\log q_{\theta}(x,x^{\prime})|\\ +\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\nabla_{\theta}^{j}\log g_{\theta}(Y_{k}|\overline{\bf{Y}}_{k-1},x,W_{k})|$ . Define, for $x\in\mathcal{X}$ , $k\geq 1$ , $m\geq 0$ , and $j=1,2$ ,333DMR (page 2272) use the symbol $\Delta_{k,m,x}(\theta)$ to denote our $\Psi_{k,m,x}^{1}(\theta)$ , but we use $\Psi_{k,m,x}(\theta)$ to avoid confusion with $\Delta_{k,m,x}(\theta)$ used in Lemma 3.

[TABLE]

It follows from (12) and (13) that $\Psi_{k,m,x}^{1}(\theta)=\nabla_{\theta}\log p_{\theta}({\bf Y}^{k}_{-m+1}|\overline{\bf Y}_{-m},{\bf W}^{k}_{-m},X_{-m}=x)-$

$\nabla_{\theta}\log p_{\theta}({\bf Y}^{k-1}_{-m+1}|\overline{\bf Y}_{-m},{\bf W}^{k-1}_{-m},X_{-m}=x)=\nabla_{\theta}\log p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{-m},{\bf W}^{k}_{-m},X_{-m}=x)$ . Therefore, we can express $\nabla_{\theta}l_{n}(\theta,x_{0})$ as

[TABLE]

Lemma 4 below shows that $\{\Psi_{k,m,x}^{j}(\theta)\}_{m\geq 0}$ is a Cauchy sequence that converges to a limit at an exponential rate in probability. Note that $\Psi_{k,m,x}^{j}(\theta)$ is a function of $\mathbb{E}_{\theta}[\phi^{j}_{\theta t}|\cdot]$ for $t=-m+1,\ldots,k$ . When $t$ is large, the difference between $\mathbb{E}_{\theta}[\phi^{j}_{\theta t}|\overline{\bf Y}^{k}_{-m},{\bf W}^{k}_{-m},X_{-m}=x]$ and $\mathbb{E}_{\theta}[\phi^{j}_{\theta t}|\overline{\bf Y}^{k}_{-m^{\prime}},{\bf W}^{k}_{-m^{\prime}},X_{-m^{\prime}}=x^{\prime}]$ with $m^{\prime}>m$ is small because the chain $\{X_{t}\}_{t=-m^{\prime}}^{k}$ conditional on $(\overline{\bf Y}_{-m^{\prime}}^{k},{\bf W}_{-m^{\prime}}^{k})$ forgets its past (i.e., $\overline{\bf Y}_{-m^{\prime}}^{m}$ , ${\bf W}_{-m^{\prime}}^{m}$ , and $X_{-m}$ ) at an exponential rate by virtue of Lemma 1. When $t$ is small, the term $\mathbb{E}_{\theta}[\phi^{j}_{\theta t}|\overline{\bf Y}^{k}_{-m},{\bf W}^{k}_{-m},X_{-m}=x]-\mathbb{E}_{\theta}[\phi^{j}_{\theta t}|\overline{\bf Y}^{k-1}_{-m},{\bf W}^{k-1}_{-m},X_{-m}=x]$ in $\Psi_{k,m,x}^{j}(\theta)$ is small because Lemma 10 in the appendix shows that the time-reversed process $\{X_{k-t}\}_{0\leq t\leq k+m}$ conditional on $(\overline{\bf Y}^{k}_{-m},{\bf W}^{k}_{-m})$ forgets its initial condition (i.e., $Y_{k}$ and $W_{k}$ ) at an exponential rate.

Define, for $k\geq 0$ , $m\geq 0$ , and $j=1,2$ ,

[TABLE]

Note that $\Psi_{k,m}^{1}(\theta)=\nabla_{\theta}\log p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{-m},{\bf W}^{k}_{-m})$ . From Lemma 1 and Lemma 10, we obtain the following bound on $\Psi_{k,m,x}^{j}(\theta)-\Psi_{k,m}^{j}(\theta)$ and $\Psi_{k,m,x}^{j}(\theta)-\Psi_{k,m^{\prime},x^{\prime}}^{j}(\theta)$ .

Lemma 4.

Assume Assumptions 1–8. Then, for $j=1,2$ , there exist a constant $\rho\in(0,1)$ , random sequences $\{A_{k,m}\}_{k\geq 1,m\geq 0}$ and $\{B_{m}\}_{m\geq 0}$ , and a random variable $K_{j}\in L^{3-j}(\mathbb{P}_{\theta^{*}})$ such that, for all $1\leq k\leq n$ and $m^{\prime}\geq m\geq 0$ ,

[TABLE]

where $\mathbb{P}_{\theta^{*}}\left(A_{k,m}\geq 1\ \text{i.o.}\right)=0$ , $B_{m}<\infty$ $\mathbb{P}_{\theta^{*}}$ -a.s., and the distribution function of $B_{m}$ does not depend on $m$ .

Because $B_{m}\rho^{\lfloor(k+m)/4(p+1)\rfloor/2}\to_{p}0$ as $m\to\infty$ , Lemma 4 implies that $\{\Psi_{k,m,x}^{1}(\theta)\}_{m\geq 0}$ converges to $\Psi_{k,\infty}^{1}(\theta)=\nabla_{\theta}\log p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{-\infty},{\bf W}^{k}_{-\infty})$ in probability uniformly in $\theta\in G$ and $x\in\mathcal{X}$ . Define the filtration $\mathcal{F}$ by $\mathcal{F}_{k}:=\sigma((\overline{\bf{Y}}_{i},W_{i+1}):-\infty<i\leq k)$ . It follows from $\mathbb{E}_{\theta^{*}}[\Psi_{k,m}^{1}(\theta^{*})|\overline{\bf Y}^{k-1}_{-m},{\bf W}^{k}_{-m}]=0$ , Assumption 8, and combining Exercise 2.3.7 and Theorem 5.5.9 of Durrett (2010) that

$\mathbb{E}_{\theta^{*}}[\Psi_{k,\infty}^{1}(\theta^{*})|\overline{\bf{Y}}_{-\infty}^{k-1},{\bf W}^{k}_{-\infty}]=0$ and $I(\theta^{*}):=\mathbb{E}_{\theta^{*}}[\Psi_{0,\infty}^{1}(\theta^{*})(\Psi_{0,\infty}^{1}(\theta^{*}))^{\prime}]<\infty$ . Therefore,

$\{\Psi_{k,\infty}^{1}(\theta^{*})\}_{k=-\infty}^{\infty}$ is an $(\mathcal{F},\mathbb{P}_{\theta^{*}})$ -adapted stationary, ergodic, and square integrable martingale difference sequence, to which a martingale central limit theorem is applicable.

Setting $m=0$ and letting $m^{\prime}\to\infty$ in Lemma 4 shows that

$n^{-1/2}\sum_{k=1}^{n}\Psi_{k,0,x_{0}}^{1}(\theta^{*})-n^{-1/2}\sum_{k=1}^{n}\Psi_{k,\infty}^{1}(\theta^{*})$ is bounded by $n^{-1/2}\sum_{k=1}^{n}k^{2}\tilde{\rho}^{k}$ in probability for some $\tilde{\rho}\in(0,1)$ . Consequently, as the following proposition shows, the score function is asymptotically normally distributed.

Proposition 2.

Assume Assumptions 1–8. Then, (a) for any $x_{0}\in\mathcal{X}$ , $n^{-1/2}\nabla_{\theta}l_{n}(\theta^{*},x_{0})\to_{d}N(0,I(\theta^{*}))$ ; (b) for any probability measure $\xi$ on $\mathcal{B}(\mathcal{X})$ for $x_{0}$ , $n^{-1/2}\nabla_{\theta}l_{n}(\theta^{*},\xi)\to_{d}N(0,I(\theta^{*}))$ .

5.2 Convergence of the Hessian

This section derives the probability limit of $n^{-1}\nabla_{\theta}^{2}l_{n}(\theta,x_{0})$ and $n^{-1}\nabla_{\theta}^{2}l_{n}(\theta,\xi)$ when $\theta$ is in a neighborhood of $\theta^{*}$ . The Louis missing information principle for the second derivative is given by $\nabla_{\theta}^{2}\log p_{\theta}(Y|W)=\mathbb{E}_{\theta}\left[\nabla_{\theta}^{2}\log p_{\theta}(Y,X|W)\middle|Y,W\right]+\text{var}_{\theta}\left[\nabla_{\theta}\log p_{\theta}(Y,X|W)\middle|Y,W\right]$ . In terms of the variables in our model, we have, for any $k\geq 1$ and $m\geq 0$ ,

[TABLE]

Define

[TABLE]

From (13)–(16), we can write $\nabla_{\theta}^{2}l_{n}(\theta,x_{0})$ in terms of $\{\Psi^{2}_{k,m,x}(\theta)\}$ and $\{\Gamma_{k,m,x}(\theta)\}$ as

[TABLE]

The following lemma provides the bounds on $\Gamma_{k,m,x}(\theta)$ that are analogous to Lemma 4.

Lemma 5.

Assume Assumptions 1–8. Then, there exist a constant $\rho\in(0,1)$ , random sequences $\{C_{k,m}\}_{k\geq 1,m\geq 0}$ and $\{D_{m}\}_{m\geq 0}$ , and a random variable $K\in L^{1}(\mathbb{P}_{\theta^{*}})$ such that, for all $1\leq k\leq n$ and $m^{\prime}\geq m\geq 0$ ,

[TABLE]

where $\mathbb{P}_{\theta^{*}}\left(C_{k,m}\geq 1\text{ i.o.}\right)=0$ , $D_{m}<\infty$ $\mathbb{P}_{\theta^{*}}$ -a.s. and the distribution function of $D_{m}$ does not depend on $m$ .

Lemma 5 implies that $\{\Gamma_{k,m,x}(\theta)\}_{m\geq 0}$ converges to $\Gamma_{k,\infty}(\theta)$ in probability uniformly in $x\in\mathcal{X}$ and $\theta\in G$ . The following proposition is a local uniform law of large numbers for the observed Hessian.

Proposition 3.

Assume Assumptions 1–8. Then,

[TABLE]

The following proposition shows the asymptotic normality of the MLE.

Proposition 4.

Assume Assumptions 1–8. Then, (a) for any $x_{0}\in\mathcal{X}$ , $n^{-1/2}(\hat{\theta}_{x_{0}}-\theta^{*})\to_{d}N(0,I(\theta^{*})^{-1})$ ; (b) for any probability measure $\xi$ on $\mathcal{B}(\mathcal{X})$ for $x_{0}$ , $n^{-1/2}(\hat{\theta}_{\xi}-\theta^{*})\to_{d}N(0,I(\theta^{*})^{-1})$ .

5.3 Convergence of the covariance matrix estimate

When conducting statistical inferences with the MLE, the researcher needs to estimate the asymptotic covariance matrix of the MLE. Proposition 3 already derived the consistency of the observed Hessian. We derive the consistency of the outer-product-of-gradients (OPG) estimates:

[TABLE]

where $\nabla_{\theta}\log p_{\theta\xi}(Y_{k}|\overline{\bf Y}^{k-1}_{0},{\bf W}^{k}_{0}):=\nabla_{\theta}\log\int p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{0},{\bf W}^{k}_{0},x_{0})\xi(dx_{0})$ . In applications,

$\nabla_{\theta}\log p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{0},{\bf W}^{k}_{0},x_{0})$ can be computed by numerically differentiating $\log p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{0},{\bf W}^{k}_{0},x_{0})$ , which in turn can be computed by using the recursive algorithm of Hamilton (1996).

The following proposition shows the consistency of the OPG estimate. Its proof is similar to that of Proposition 3 and hence omitted.

Proposition 5.

Assume Assumptions 1–8. Then, $\sup_{x_{0}\in\mathcal{X}}|\hat{I}_{x_{0}}(\hat{\theta})-I(\theta^{*})|\to_{p}0$ and $\hat{I}_{\xi}(\hat{\theta})\to_{p}I(\theta^{*})$ for any $\hat{\theta}$ such that $\hat{\theta}\to_{p}\theta^{*}$ and any $\xi$ .

6 Simulation

As an illustration, we provide a small simulation study based on Hamilton’s model (3) and the MS-CD model (4) with the Weibull distribution. The simulation was conducted with an R package we developed for Markov regime switching models.444The R package is available at https://github.com/chiyahn/rMSWITCH.

6.1 Hamilton’s model

We generate 1000 data sets of sample sizes $n=200,400$ , and $800$ from model (3) with $p=5$ , using the parameter value taken from Table I of Hamilton (1989) with $\theta=(\mu_{1},\mu_{2},\gamma_{1},\gamma_{2},\gamma_{3},\gamma_{4},\sigma,p_{11},p_{22})^{\prime}=(1.522,-0.3577,0.014,-0.058,-0.247,-0.213,0.7690,0.9049,0.7550)^{\prime}$ .555We simulate $(800+n)$ periods and use the last $n$ observations as our sample, so that the initial value for our data set is approximately drawn from the stationary distribution. For each data, we estimate the parameter $\theta$ together with the initial distribution of $X_{0}$ , $\xi$ . Panel A of Table 1 reports the frequency at which the 95 percent confidence interval constructed from (18) contains the true parameter value. The asymptotic 95 percent confidence intervals slightly undercover the true parameter at $n=200$ but the actual coverage probability approaches 95 percent as the sample size increases from $n=200$ to $400$ , and then to $800$ . Panel B of Table 1 presents the coverage probabilities when we use the estimator (17) by setting $x_{0}=2$ rather than (18). Consistent with our theoretical derivation, the results in Panel B of Table 1 are similar to those in Panel A of Table 1, suggesting that the choice of the initial value of $x_{0}$ in constructing the covariance matrix estimate does not affect the coverage probabilities.

6.2 MS-CD model

We generate 1000 data sets of sample sizes $n=200,400$ , and $800$ from the MS-CD model (4), using the parameter value $\theta=(\mu_{1},\mu_{2},\beta,\gamma,p_{11},p_{22},\Pr(X_{0}=1))^{\prime}=(0.5,1.2,0.05,0.95,0.95,0.95,0.5)$ , and examine the coverage probabilities of the asymptotic 95 percent confidence intervals. Panel A of Table 2 presents the coverage probabilities based on (18) and Panel B presents those based on (17) by setting $x_{0}=2$ . The coverage probability improves as the sample size increases from $n=200$ to $n=800$ in both Panel A and Panel B. The results in Panels A and B are similar, indicating that the choice of the initial value of $x_{0}$ does not affect the confidence intervals.

7 Empirical application: Duration between stock price changes

We estimate the MS-CD model (4) by using duration data taken from De Luca and Gallo (2004) on the FIAT stock traded on the Milan Stock Exchange between May 2, 2000 and May 15, 2000, where the duration is defined as the time between every price change. We use their “adjusted durations,” which remove the daily seasonal component as well as exclude overnight durations between the first price change of a day and the last price change of the previous day. See De Luca and Gallo (2004) for more details on the construction of their adjusted durations.

We estimate the MS-CD model for $M=2$ and $3$ . The regimes are ordered from the smallest to the largest in terms of the estimated values of $\mu_{X_{k}}$ . For the model with $M=3$ , we restrict some elements of the transition probability matrix so that $\Pr(X_{k}=3|X_{k-1}=1)=\Pr(X_{k}=1|X_{k-1}=3)=0$ .666When the model is estimated without restricting the transition probabilities, the estimated transition probabilities between the first and third regimes are close to zero.

Table 3 reports the parameter estimates and their standard errors constructed from (18) for models with $M=2$ and $3$ . For both $M=2$ and $3$ , the estimated values of $\mu_{X_{k}}$ are well separated across regimes given the relatively small standard errors. The estimated values of $\gamma$ are $0.987$ and $1.005$ for the models with $M=2$ and $3$ , respectively, providing some evidence that the density function is unbounded for the model with $M=2$ .

The upper panel of Figure 1 shows the posterior probabilities of being in each regime for the model with $M=2$ for the first 3000 observations, where the solid red line represents the “more frequent price changes” regime (Regime 1), while the dotted blue line represents the “less frequent price changes” regime (Regime 2). Reflecting the high persistence of latent regimes, the posterior probabilities of being in each regime are either close to zero or one continuously over a prolonged period; the FIAT stock is in Regime 2 from the 1200th to 1900th observations and then switches to Regime 1 until the 2700th observation. As reported in the lower panel of Figure 1, when the number of regimes is specified as $M=3$ , the FIAT stock is the least frequently traded (Regime 3) from the 1200th to 1800th observations and most frequently traded (Regime 1) from the 1900th to 2100th observations as well as from the 2200th to 2500th observations.

Appendix A Proofs

Throughout these proofs, define $\overline{\bf V}_{b}^{a}:=(\overline{\bf{Y}}_{b}^{a},{\bf W}_{b}^{a})$ .

Proof of Lemma 1.

This lemma is an immediate consequence of Lemmas 6 and 7 when $-m+p\leq k\leq n$ . When $k<-m+p$ , this lemma holds because $\|\mu_{1}-\mu_{2}\|_{TV}\leq 1$ for any probability measures $\mu_{1}$ and $\mu_{2}$ . ∎

Proof of Lemma 2.

In view of (8), the stated result holds if there exist constants $\rho\in(0,1)$ and $M<\infty$ and a random sequence $\{b_{k}\}$ with $\mathbb{P}_{\theta^{*}}(b_{k}\geq M\text{ i.o.})=0$ such that, for $k=1,\ldots,n$ ,

[TABLE]

because $b_{+}(\overline{\bf Y}_{k-1}^{k},W_{k})/b_{-}(\overline{\bf Y}_{k-1}^{k},W_{k})<\infty$ $\mathbb{P}_{\theta^{*}}$ -a.s. from Assumption 4.

First, it follows from $p_{\theta}(Y_{k}|\overline{\bf Y}_{0}^{k-1},{\bf W}_{0}^{k},x_{0})=\int g_{\theta}(Y_{k}|\overline{\bf Y}_{k-1},x_{k},W_{k})\mathbb{P}_{\theta}(dx_{k}|x_{0},\overline{\bf Y}^{k-1}_{0},{\bf W}_{0}^{k})$ ,

$p_{\theta}(Y_{k}|\overline{\bf Y}_{0}^{k-1},{\bf W}_{0}^{k})=\int g_{\theta}(Y_{k}|\overline{\bf Y}_{k-1},x_{k},W_{k})\mathbb{P}_{\theta}(dx_{k}|\overline{\bf Y}^{k-1}_{0},{\bf W}_{0}^{k})$ , and Assumption 4(a) that

$p_{\theta}(Y_{k}|\overline{\bf Y}_{0}^{k-1},{\bf W}_{0}^{k},x_{0}),p_{\theta}(Y_{k}|\overline{\bf Y}_{0}^{k-1},{\bf W}_{0}^{k})\in[b_{-}(\overline{\bf Y}_{k-1}^{k},W_{k}),b_{+}(\overline{\bf Y}_{k-1}^{k},W_{k})]$ uniformly in $\theta\in\Theta$ and $x_{0}\in\mathcal{X}$ . Hence, from the inequality $|\log x-\log y|\leq|x-y|/(x\wedge y)$ , we have, for $k=1,\ldots,n$ ,

[TABLE]

This gives the first bound in (19).

We proceed to derive the second bound in (19). Using a derivation similar to (49) and noting that $X_{k}$ is independent of $W_{k}$ given $X_{k-1}$ gives, for any $-m+p\leq k\leq n$ ,

[TABLE]

Consequently, for any $-m+p\leq k\leq n$ ,

[TABLE]

Furthermore,

[TABLE]

Combining (22), (23), and (24) for $m=0$ and applying Lemma 1 and the property of the total variation distance gives that, for any $p\leq k\leq n$ and uniformly in $x_{0}\in\mathcal{X}$ ,

[TABLE]

Furthermore, (22) and (23) imply that, for any $k\geq p$ , $(p_{\theta}(Y_{k}|\overline{\bf Y}_{0}^{k-1},{\bf W}_{0}^{k},x_{0})\wedge p_{\theta}(Y_{k}|\overline{\bf Y}_{0}^{k-1},{\bf W}_{0}^{k}))\\ \geq\inf_{x_{k}^{\prime},x_{k-p}\in\mathcal{X}}p_{\theta}(x_{k}^{\prime}|x_{k-p},\overline{\bf Y}^{k-1}_{k-p},{\bf W}_{k-p}^{k-1})\int g_{\theta}(Y_{k}|\overline{\bf Y}_{k-1},x_{k},W_{k})\mu(dx_{k})$ . Therefore, it follows from $|\log x-\log y|\leq|x-y|/(x\wedge y)$ , (25), and (51) and the subsequent argument that, for $p\leq k\leq n$ ,

[TABLE]

We first bound $\prod_{i=1}^{\lfloor(k-p)/p\rfloor}(1-\omega(\overline{\bf V}_{pi-p}^{pi-1}))$ on the right hand side of (26). Fix $\epsilon\in(0,1/8]$ . Because $\omega(\overline{\bf V}_{t-p}^{t-1})>0$ for all $\overline{\bf V}_{t-p}^{t-1}\in\mathcal{Y}^{p+s-1}\times\mathcal{W}^{p}$ from Assumption 3 (note that $\omega(\overline{\bf V}_{t-p}^{t-1})=\sigma_{-}/\sigma_{+}>0$ when $p=1$ ), there exists $\rho\in(0,1)$ such that $\mathbb{P}_{\theta^{*}}(1-\omega(\overline{\bf V}_{t-p}^{t-1})\geq\rho)\leq\epsilon$ . Define $I_{i}:=\mathbb{I}\{1-\omega(\overline{\bf V}_{pi-p}^{pi-1})\geq\rho\}$ ; then, we have $\mathbb{E}_{\theta^{*}}[I_{i}]\leq\epsilon$ and $1-\omega(\overline{\bf V}_{pi-p}^{pi-1})\leq\rho^{1-I_{i}}$ . Consequently, with $a_{k}:=\rho^{-\sum_{i=1}^{\lfloor(k-p)/p\rfloor}I_{i}}$ ,

[TABLE]

Because $\overline{\bf V}^{t-1}_{t-p}$ is stationary and ergodic, it follows from the strong law of large numbers that $(\lfloor(k-p)/p\rfloor)^{-1}\sum_{i=1}^{\lfloor(k-p)/p\rfloor}I_{i}\to\mathbb{E}_{\theta^{*}}[I_{i}]\leq\epsilon$ $\mathbb{P}_{\theta^{*}}$ -a.s. as $k\to\infty$ . Therefore, $a_{k}$ is bounded as

[TABLE]

We then bound $1/\omega(\overline{\bf V}_{k-p}^{k-1})$ on the right hand side of (26). First, we consider the case $p\geq 2$ . Define $b_{-}^{+}(\overline{\bf Y}^{i}_{i-1},W_{i}):=b_{+}(\overline{\bf Y}^{i}_{i-1},W_{i})/b_{-}(\overline{\bf Y}^{i}_{i-1},W_{i})$ and $C_{3}:=(\sigma_{-}/\sigma_{+})C_{1}^{-2(p-1)}$ with $C_{1}$ defined in Assumption 5. It follows from the definition of $\omega(\cdot)$ that $\omega(\overline{\bf V}_{k-p}^{k-1})\geq(\sigma_{-}/\sigma_{+})\prod_{i=k-p+1}^{k-1}b^{+}_{-}(\overline{\bf Y}^{i}_{i-1},W_{i})^{-2}=C_{3}C_{1}^{2(p-1)}\prod_{i=k-p+1}^{k-1}b^{+}_{-}(\overline{\bf Y}^{i}_{i-1},W_{i})^{-2}$ . In view of $\rho\in(0,1)$ , there exists a finite and positive constant $C_{4}$ such that $\rho^{\epsilon}=e^{-2\alpha(p-1)C_{4}}$ with $\alpha$ defined in Assumption 5. Then,

[TABLE]

Observe that, if $X_{1},\ldots,X_{\ell}$ are identically distributed, we have $P(X_{1}\cdots X_{\ell}\geq A)\leq P(\{X_{1}\geq A^{1/\ell}\}\cup\{X_{2}\geq A^{1/\ell}\}\cup\cdots\cup\{X_{\ell}\geq A^{1/\ell}\})\leq\sum_{i=1}^{\ell}P(X_{i}\geq A^{1/\ell})=\ell P(X_{i}\geq A^{1/\ell})$ . Therefore, (29) is bounded by $(p-1)\mathbb{P}_{\theta^{*}}(b^{+}_{-}(\overline{\bf Y}_{k-1}^{k},W_{k})\geq C_{1}e^{\alpha C_{4}\lfloor(k-p)/p\rfloor})$ . From Assumption 5, this is no larger than $(p-1)C_{2}(C_{4}\lfloor(k-p)/p\rfloor)^{-\beta}$ for $k\geq 2p$ , and $\mathbb{P}_{\theta^{*}}(\omega(\overline{\bf V}_{k-p}^{k-1})\leq C_{3}\rho^{\epsilon\lfloor(k-p)/p\rfloor}\text{ i.o.})=0$ follows from the Borel-Cantelli lemma. When $p=1$ , we have $\mathbb{P}_{\theta^{*}}(\omega(\overline{\bf V}_{k-p}^{k-1})\leq C_{3}\rho^{\epsilon\lfloor(k-p)/p\rfloor}\text{ i.o.})=0$ because $\omega(\overline{\bf V}_{k-p}^{k-1})=\sigma_{-}/\sigma_{+}$ . Substituting this bound and (27) and (28) into (26) gives, for $p\leq k\leq n$ ,

[TABLE]

where $\mathbb{P}_{\theta^{*}}(b_{k}\geq M\text{ i.o.})=0$ for a constant $M<\infty$ .

The right hand side of (30) gives the second bound in (19) because $(1-3\epsilon)\lfloor(k-p)/p\rfloor\geq\lfloor(k-p)/p\rfloor/2\geq\lfloor(k-p)/2p\rfloor\geq\lfloor k/3p\rfloor$ , where the last inequality holds because, for any numbers $a,b>0$ and $k\geq 0$ ,

[TABLE]

Therefore, (19) holds, and the stated result is proven.

∎

Proof of Lemma 3.

The proof uses a similar argument to the proof of Lemma 3 in DMR and the proof of Lemma 2. We first show part (a) for $-m+p\leq k\leq n$ . Using a similar argument to (22) and (25) in conjunction with Lemma 1 gives

[TABLE]

where $\delta(\cdot)$ denotes the Dirac delta function, and the first equality uses the fact $\mathbb{P}_{\theta}(X_{k-p}\in\cdot|X_{-m},\overline{\bf Y}^{k-1}_{-m^{\prime}},{\bf W}_{-m^{\prime}}^{k})=\mathbb{P}_{\theta}(X_{k-p}\in\cdot|X_{-m},\overline{\bf Y}^{k-1}_{-m},{\bf W}_{-m}^{k})$ , which is proven as (21).

Furthermore, (22) and (23) imply that, for any $k\geq-m+p$ , $(p_{\theta}(Y_{k}|\overline{\bf Y}_{-m}^{k-1},{\bf W}_{-m}^{k},x_{-m})\wedge p_{\theta}(Y_{k}|\overline{\bf Y}_{-m^{\prime}}^{k-1},{\bf W}_{-m^{\prime}}^{k},x_{-m^{\prime}}))\geq\inf_{x_{k}^{\prime},x_{k-p}\in\mathcal{X}}p_{\theta}(x_{k}^{\prime}|x_{k-p},\overline{\bf Y}^{k-1}_{k-p},{\bf W}_{k-p}^{k-1})\int g_{\theta}(Y_{k}|\overline{\bf Y}_{k-1},x_{k},W_{k})\mu(dx_{k})$ . Therefore, it follows from the inequality $|\log x-\log y|\leq|x-y|/(x\wedge y)$ that

[TABLE]

Proceeding as in (27)–(30) in the proof of Lemma 2, we find that there exist $\rho\in(0,1)$ and $\epsilon\in(0,1/8]$ such that the right hand side of (33) is bounded by $\rho^{(1-2\epsilon)\lfloor(k-p+m)/p\rfloor}\rho^{-\epsilon\lfloor(k-p)/p\rfloor}B_{k,m}$ , where $\mathbb{P}_{\theta^{*}}(B_{k,m}\geq M\text{ i.o.})=0$ for a constant $M<\infty$ . Therefore, part (a) is proven for $-m+p\leq k\leq n$ by noting that $\rho^{-\epsilon\lfloor(k-p)/p\rfloor}\leq\rho^{-\epsilon\lfloor(k-p+m)/p\rfloor}$ and using the argument following (30). Part (a) holds for $1\leq k\leq-m+p-1$ because $|\log p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{-m},{\bf W}_{-m}^{k},X_{-m}=x)-\log p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{-m^{\prime}},{\bf W}_{-m^{\prime}}^{k},X_{-m^{\prime}}=x^{\prime})|$ is bounded by $b_{-}^{+}(\overline{\bf Y}_{k-1}^{k},W_{k})$ , which is finite $\mathbb{P}_{\theta^{*}}$ -a.s. Part (b) follows from replacing $\mathbb{P}_{\theta}(dx_{-m}|X_{-m^{\prime}}=x^{\prime},\overline{\bf Y}^{k-1}_{-m^{\prime}},{\bf W}_{-m^{\prime}}^{k})$ in (32) with $\mathbb{P}_{\theta}(dx_{-m}|\overline{\bf Y}^{k-1}_{-m},{\bf W}_{-m}^{k})$ . Part (c) follows from $b_{-}(\overline{\bf Y}_{k-1}^{k},W_{k})\leq p_{\theta}(Y_{k}|\overline{\bf Y}^{k-1}_{-m},{\bf W}_{-m}^{k},X_{-m}=x)\leq b_{+}(\overline{\bf Y}_{k-1}^{k},W_{k})$ and Assumption 4. ∎

Proof of Proposition 1.

The proof follows the argument of the proof of Proposition 2 and Theorem 1 in DMR. From Property 24.2 of Gourieroux and Monfort (1995, page 385), the stated result holds if (i) $\Theta$ is compact, (ii) $l_{n}(\theta,x_{0})$ is continuous uniformly in $x_{0}\in\mathcal{X}$ , (iii) $\sup_{x_{0}\in\mathcal{X}}\sup_{\theta\in\Theta}|n^{-1}l_{n}(\theta,x_{0})-l(\theta)|\to 0$ $\mathbb{P}_{\theta^{*}}$ -a.s., and (iv) $l(\theta)$ is uniquely maximized at $\theta^{*}$ .

(i) follows from Assumption 1(a). (ii) follows from Assumption 6(a). In view of Lemma 2 and the compactness of $\Theta$ , (iii) holds if, for all $\theta\in\Theta$ ,

[TABLE]

Noting that $l_{n}(\theta)=\sum_{k=1}^{n}\Delta_{k,0}(\theta)$ , the left hand side of (34) is bounded by $A+B+C$ , where

[TABLE]

Fix $x\in\mathcal{X}$ . Setting $m=0$ and letting $m^{\prime}\to\infty$ in Lemma 3(a)(b) show that $\sup_{\theta\in\Theta}|\Delta_{k,0}(\theta)-\Delta_{k,\infty}(\theta)|\leq\sup_{\theta\in\Theta}|\Delta_{k,0}(\theta)-\Delta_{k,0,x}(\theta)|+\sup_{\theta\in\Theta}|\Delta_{k,0,x}(\theta)-\Delta_{k,\infty}(\theta)|\leq 2A_{k,0}\rho^{\lfloor k/3p\rfloor}$ while

$\sup_{\theta\in\Theta}|\Delta_{k,0}(\theta)-\Delta_{k,0,x}(\theta)|+\sup_{\theta\in\Theta}|\Delta_{k,0,x}(\theta)-\Delta_{k,\infty}(\theta)|\leq 4B_{k}$ follows from Lemma 3(c). Consequently, $A=0$ $\mathbb{P}_{\theta^{*}}$ -a.s. $B$ is bounded by, from the ergodic theorem and Lemma 8,

[TABLE]

$C=0$ $\mathbb{P}_{\theta^{*}}$ -a.s. by the ergodic theorem, and hence (iii) holds. For (iv), observe that

$\mathbb{E}_{\theta^{*}}|\log p_{\theta}(Y_{1}|\overline{\bf{Y}}_{-m}^{0},{\bf W}_{-m}^{1})|<\infty$ from Lemma 3(c). Therefore, for any $m$ , $\mathbb{E}_{\theta^{*}}[\log p_{\theta}(Y_{1}|\overline{\bf{Y}}_{-m}^{0},{\bf W}_{-m}^{1})]$ is uniquely maximized at $\theta^{*}$ from Lemma 2.2 of Newey and McFadden (1994) and Assumption 6(b). Then, (iv) follows because $\mathbb{E}_{\theta^{*}}[\log p_{\theta}(Y_{1}|\overline{\bf{Y}}_{-m}^{0},{\bf W}_{-m}^{1})]$ converges to $l(\theta)$ uniformly in $\theta$ as $m\to\infty$ from Lemma 3 and the dominated convergence theorem. Therefore, (iv) holds, and the stated result is proven. ∎

Proof of Corollary 1.

Observe that $|n^{-1}l_{n}(\theta,\xi)-l(\theta)|\leq\sup_{x_{0}\in\mathcal{X}}|n^{-1}l_{n}(\theta,x_{0})-l(\theta)|$ because

$\inf_{x_{0}\in\mathcal{X}}l_{n}(\theta,x_{0})\leq l_{n}(\theta,\xi)\leq\sup_{x_{0}\in\mathcal{X}}l_{n}(\theta,x_{0})$ . Furthermore, $l_{n}(\theta,\xi)$ is continuous in $\theta$ from the continuity of $l_{n}(\theta,x_{0})$ . Therefore, the stated result follows from the proof of Proposition 1. ∎

Proof of Lemma 4.

The proof follows the argument of the proof of Lemma 13 in DMR. When $(k,m)=(1,0)$ , the stated result follows from $\Psi_{1,0,x}^{j}(\theta)=\mathbb{E}_{\theta^{*}}[\phi^{j}_{\theta 1}|\overline{\bf V}_{0},X_{0}=x]$ , $\Psi_{1,0}^{j}(\theta)=\mathbb{E}_{\theta^{*}}[\phi^{j}_{\theta 1}|\overline{\bf V}_{0}]$ , $\sup_{\theta\in G}|\phi^{j}_{\theta k}|\leq|\phi^{j}_{k}|_{\infty}$ , and Assumption 7. Henceforth, assume $(k,m)\neq(1,0)$ so that $k+m\geq 2$ .

For part (a), it follows from Lemma 11(a)–(e) that

[TABLE]

where $\Omega_{t-1,-m}:=\prod_{i=1}^{\lfloor(t-1+m)/p\rfloor}(1-\omega(\overline{\bf V}_{-m+pi-p}^{-m+pi-1}))$ and $\tilde{\Omega}_{t,k-1}:=\prod_{i=1}^{\lfloor(k-1-t)/p\rfloor}(1-\omega(\overline{\bf V}_{k-2-pi+1}^{k-2-pi+p}))$ as defined in the paragraph preceding Lemma 11. As shown on page 2294 of DMR, we have

$\max_{-m\leq t^{\prime}\leq k}|\phi_{t^{\prime}}^{j}|_{\infty}\leq\sum_{t=-m}^{k}(|t|\vee 1)^{2}|\phi_{t}^{j}|_{\infty}/(|t|\vee 1)^{2}\leq 2(k\vee m)^{2}[\sum_{t=-\infty}^{\infty}|\phi_{t}^{j}|_{\infty}/(|t|\vee 1)^{2}]\leq(k+m)^{2}K_{j}$ with $K_{j}\in L^{3-j}(\mathbb{P}_{\theta^{*}})$ .

We proceed to bound $\sum_{t=-m+1}^{k}(\Omega_{t-1,-m}\wedge\tilde{\Omega}_{t,k-1})$ on the right hand side of (35). Similar to the proof of Lemma 2, fix $\epsilon\in(0,1/8p(p+1)]$ ; then, there exists $\rho\in(0,1)$ such that $\mathbb{P}_{\theta^{*}}(1-\omega(\overline{\bf V}_{k-p}^{k-1})\geq\rho)\leq\epsilon$ . Define $I_{p,i}:=\sum_{t=0}^{(p-2)_{+}}\mathbb{I}\{1-\omega(\overline{\bf V}_{t+i}^{t+i+p-1})\geq\rho\}$ and $\nu_{b}^{a}:=\sum_{i=b}^{a}I_{p,i}$ . Observe that (recall we define $\prod_{i=c}^{d}x_{i}=1$ when $c>d$ )

[TABLE]

where the second inequality follows from $\lfloor x\rfloor-\lfloor y\rfloor\geq\lfloor x-y\rfloor$ , $(\lfloor x/p\rfloor)_{+}=\lfloor x_{+}/p\rfloor$ , $s+p(\lfloor(b-s)/p\rfloor+1)-p\geq b-p$ , and $s+p\lfloor(a-s)/p\rfloor-1\leq a-1$ . Similarly, we obtain

[TABLE]

because $k-2-p\lfloor(k-1-b)/p\rfloor+1\geq b$ and $k-2-p(\lfloor(k-1-a)/p\rfloor+1)+p\leq a+p-1$ . By applying (36) to $\Omega_{t-1,-m}$ with $a=t-1,b=s=-m$ , applying (37) to $\tilde{\Omega}_{t,k-1}$ with $a=k-1$ and $b=t$ , and using (31) and $-m+1\leq t\leq k$ , we obtain

[TABLE]

Observe that, for any $\rho\in(0,1)$ , $c>0$ and any integers $a<b$ ,

[TABLE]

From (39), $\sum_{t=-m+1}^{k}(\Omega_{t-1,-m}\wedge\tilde{\Omega}_{t,k-1})$ is bounded by $2(p+1)\rho^{\lfloor(k+m)/2(p+1)\rfloor-\nu_{-m-p}^{k}}/(1-\rho)$ . Because $\overline{\bf V}_{i}^{i+p-1}$ is stationary and ergodic, it follows from the strong law of large numbers that $(\lfloor(k+m)/2(p+1)\rfloor)^{-1}\nu_{-m-p}^{k}\to 2(p+1)\mathbb{E}_{\theta^{*}}[I_{p,i}]\leq 2p(p+1)\epsilon$ $\mathbb{P}_{\theta^{*}}$ -a.s. as $k+m\to\infty$ . In view of $\epsilon<1/8p(p+1)$ , we have $\mathbb{P}_{\theta^{*}}(\rho^{\lfloor(k+m)/2(p+1)\rfloor-\nu_{-m-p}^{k}}\geq\rho^{\lfloor(k+m)/2(p+1)\rfloor/2}\text{ i.o.})=0$ . Henceforth, let $\{b_{k,m}\}_{k\geq 1,m\leq 0}$ denote a generic nonnegative random sequence such that $\mathbb{P}_{\theta^{*}}(b_{k,m}\geq M\text{ i.o.})=0$ for a finite constant $M$ . With this notation and the fact that $\lfloor(k+m)/2(p+1)\rfloor/2\geq\lfloor(k+m)/4(p+1)\rfloor$ , $\sum_{t=-m+1}^{k}(\Omega_{t-1,-m}\wedge\tilde{\Omega}_{t,k-1})$ is bounded by

[TABLE]

and part (a) is proven.

For part (b), it follows from (13) and Lemma 11(a)–(e) that

[TABLE]

The first term on the right hand side is bounded by $(k+m)^{2}K_{j}\rho^{\lfloor(k+m)/4(p+1)\rfloor}b_{k,m}$ with $K_{j}\in L^{3-j}(\mathbb{P}_{\theta^{*}})$ from the same argument as the proof of part (a). For the second term on the right hand side, write $\tilde{\Omega}_{t,k-1}$ as $\tilde{\Omega}_{t,k-1}=\tilde{\Omega}_{-m,k-1}\tilde{\Omega}_{t,k-1}^{-m}$ , where $\tilde{\Omega}_{t,k-1}^{-m}:=\prod_{i=\lfloor(k-1+m)/p\rfloor+1}^{\lfloor(k-1-t)/p\rfloor}(1-\omega(\overline{\bf V}_{k-2-pi+1}^{k-2-pi+p}))$ . By applying (37) to $\tilde{\Omega}_{t,k-1}^{-m}$ with $a=-m$ and $b=t$ , we obtain $\tilde{\Omega}_{t,k-1}^{-m}\leq\rho^{\lfloor(-m-t)/p\rfloor-\nu_{t}^{-m}}$ . In conjunction with $\tilde{\Omega}_{t,k-1}^{-m}\leq 1$ , the second term on the right hand side is bounded by $2\tilde{\Omega}_{-m,k-1}R_{m,m^{\prime}}$ , where

[TABLE]

From a similar argument to (38)–(40), we can bound $\tilde{\Omega}_{-m,k-1}$ as $\tilde{\Omega}_{-m,k-1}\leq\rho^{\lfloor(k+m)/4(p+1)\rfloor}b_{k,m}$ . It follows from $(-m-t)^{-1}\nu_{t}^{-m}\to\mathbb{E}_{\theta^{*}}[I_{p,i}]\leq p\epsilon$ $\mathbb{P}_{\theta^{*}}$ -a.s. as $t+m\to-\infty$ that $\mathbb{P}_{\theta^{*}}(d_{t,m}\geq\rho^{\lfloor(-m-t)/p\rfloor/2}\text{ i.o.})=0$ . Furthermore, $|\phi_{t}^{j}|_{\infty}$ satisfies $\mathbb{P}_{\theta^{*}}(|\phi_{t}^{j}|_{\infty}\geq\rho^{-\lfloor(-m-t)/p\rfloor/4}\text{ i.o.})=0$ from Markov’s inequality and the Borel-Cantelli lemma. Therefore, $\mathbb{P}_{\theta^{*}}(d_{t,m}|\phi_{t}^{j}|_{\infty}\geq\rho^{\lfloor(-m-t)/p\rfloor/4}\text{ i.o.})=0$ . In conjunction with $0\leq d_{t,m}|\phi_{t}^{j}|_{\infty}<\infty$ $\mathbb{P}_{\theta^{*}}$ -a.s., we obtain $\overline{R}_{m}:=\sup_{m^{\prime}\geq m}R_{m,m^{\prime}}<\infty$ $\mathbb{P}_{\theta^{*}}$ -a.s., and the distribution of $\overline{R}_{m}$ does not depend on $m$ because $\overline{\bf V}_{t}$ is stationary. Therefore, part (b) is proven by setting $B_{m}=\overline{R}_{m}$ . ∎

Proof of Proposition 2.

By setting $m=0$ and letting $m^{\prime}\to\infty$ in Lemma 4, we obtain

$\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\Psi_{k,0,x}^{1}(\theta)-\Psi_{k,\infty}^{1}(\theta)|\leq(K_{1}+B_{0})k^{2}\rho^{\lfloor k/4(p+1)\rfloor}A_{k,0}$ . Furthermore, the sum over finitely many $\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\Psi_{k,0,x}^{1}(\theta)-\Psi_{k,\infty}^{1}(\theta)|$ is $o(n^{1/2})$ $\mathbb{P}_{\theta^{*}}$ -a.s. because

$\mathbb{E}_{\theta^{*}}[\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\Psi_{k,0,x}^{1}(\theta)|]<\infty$ and $\mathbb{E}_{\theta^{*}}[\sup_{\theta\in G}|\Psi_{k,\infty}^{1}(\theta)|]<\infty$ from Assumption 8. Therefore, we have $n^{-1/2}\nabla_{\theta}l_{n}(\theta^{*},x_{0})=n^{-1/2}\sum_{k=1}^{n}\Psi_{k,0,x_{0}}^{1}(\theta^{*})=n^{-1/2}\sum_{k=1}^{n}\Psi_{k,\infty}^{1}(\theta^{*})+o_{p}(1)$ .

Because $\{\Psi_{k,\infty}^{1}(\theta^{*})\}_{k=-\infty}^{\infty}$ is a stationary, ergodic, and square integrable martingale difference sequence, it follows from a martingale difference central limit theorem (McLeish, 1974, Theorem 2.3) that $n^{-1/2}\sum_{k=1}^{n}\Psi_{k,\infty}^{1}(\theta^{*})\to_{d}N(0,I(\theta^{*}))$ , and part (a) follows. For part (b), let $p_{n\theta}(x_{0})$ denote $p_{\theta}({\bf Y}_{1}^{n}|\overline{\bf Y}_{0},{\bf W}_{0}^{n},x_{0})$ , and observe that

[TABLE]

Therefore, $\min_{x_{0}}\nabla_{\theta}l_{n}(\theta^{*},x_{0})\leq\nabla_{\theta}l_{n}(\theta^{*},\xi)\leq\max_{x_{0}}\nabla_{\theta}l_{n}(\theta^{*},x_{0})$ holds, and part (b) follows. ∎

Proof of Lemma 5.

The proof follows the argument of the proof of Lemma 17 in DMR and the proof of Lemma 4. Fix $\epsilon\in(0,1/32p(p+1)]$ and choose $\rho\in(0,1)$ as in the proof of Lemma 4. When $(k,m)=(1,0)$ , the stated result follows from $\sup_{\theta\in G}|\phi_{\theta k}|\leq|\phi_{k}|_{\infty}$ . Henceforth, assume $(k,m)\neq(1,0)$ so that $k+m\geq 2$ . For $a\leq b$ , define $S_{a}^{b}:=\sum_{t=a}^{b}\phi_{\theta t}$ . Let $\{b_{k,m}\}_{k\geq 1,m\leq 0}$ denote a generic nonnegative random sequence such that $\mathbb{P}_{\theta^{*}}(b_{k,m}\geq M\text{ i.o.})=0$ for a finite constant $M$ . We prove part (a) first. Write $\Gamma_{k,m,x}(\theta)-\Gamma_{k,m}(\theta)=A+2B+C$ , where

[TABLE]

From Lemma 11(f)–(k), $A$ is bounded as

[TABLE]

From equation (46) of DMR on page 2299, we have $\max_{-m+1\leq s\leq t\leq k-1}|\phi_{t}|_{\infty}|\phi_{s}|_{\infty}\leq\\ (m^{3}+k^{3})\sum_{t=-\infty}^{\infty}|\phi_{t}|_{\infty}^{2}/(|t|\vee 1)^{2}\leq(k+m)^{3}K$ for $K\in L^{1}(\mathbb{P}_{\theta^{*}})$ .

We proceed to bound $\Omega_{s-1,-m}\wedge\Omega_{t-1,s}\wedge\tilde{\Omega}_{t,k-1}$ . By using the argument in (36)-(38), we obtain

[TABLE]

Furthermore, a derivation similar to DMR (page 2299) gives, for $n\geq 2$ ,

[TABLE]

From (39), the right hand side is bounded by, for a generic positive constant $\mathcal{C}$ that may take different values at different places,

[TABLE]

where the inequality holds because $\sum_{t=a}^{\infty}\rho^{\lfloor t/b\rfloor}\leq b\rho^{\lfloor a/b\rfloor}/(1-\rho)$ for any integers $a\geq 0$ and $b>0$ . Hence, $A$ is bounded by $K(k+m)^{3}\rho^{\lfloor(k+m)/4(p+1)\rfloor}b_{k,m}$ by setting $n=k+m$ in (42) and noting that $(\lfloor(k+m)/4(p+1)\rfloor)^{-1}\nu_{-m-p}^{k}\to 4(p+1)\mathbb{E}_{\theta^{*}}[I_{p,i}]\leq 4p(p+1)\epsilon<1/2$ $\mathbb{P}_{\theta^{*}}$ -a.s. as $k+m\to\infty$ . For $B$ , from Lemma 11(f)–(i), (36), (38), $t\geq-m$ , and (39), $B$ is bounded as, with $M_{k}:=\max_{-m+1\leq t\leq k-1}|\phi_{k}|_{\infty}|\phi_{t}|_{\infty}$ ,

[TABLE]

which is written as $K(k+m)^{3}\rho^{\lfloor(k+m)/4(p+1)\rfloor}b_{k,m}$ for $K\in L^{1}(\mathbb{P}_{\theta^{*}})$ . $C$ is bounded by $6\Omega_{k-1,-m}|\phi_{k}|_{\infty}^{2}$ from Lemma 11(h), and part (a) is proven.

We proceed to prove part (b). Write $\Gamma_{k,m^{\prime},x^{\prime}}(\theta)=A+2B+2C+D$ , where

[TABLE]

$|\Gamma_{k,m,x}(\theta)-A|$ is bounded similarly to $|\Gamma_{k,m,x}(\theta)-\Gamma_{k,m}(\theta)|$ in part (a) by using Lemma 11. From Lemma 11(g), $B$ is bounded by $2\sum_{t=-m^{\prime}+1}^{-m}\Omega_{k-1,t}|\phi_{k}|_{\infty}|\phi_{t}|_{\infty}=B_{1}\times B_{2}$ , where

[TABLE]

$B_{1}$ is bounded by $|\phi_{k}|_{\infty}\rho^{\lfloor(k+m)/2(p+1)\rfloor}b_{k,m}$ from the same argument as part (a). Because $\mathbb{P}_{\theta^{*}}(|\phi_{k}|_{\infty}\geq\rho^{-\lfloor(k+m)/2(p+1)\rfloor/2}\text{ i.o.})=0$ , $B_{1}$ is bounded by $\rho^{\lfloor(k+m)/4(p+1)\rfloor}b_{k,m}$ . For $B_{2}$ , because $\prod_{i=1}^{\lfloor(-m-t)/p\rfloor}(1-\omega(\overline{\bf V}_{t+pi-p}^{t+pi-1}))$ is bounded by $\rho^{\lfloor(-m-t)/p\rfloor-\nu_{t-p}^{-m}}$ from (36), we can use the same argument as the one for $R_{m,m^{\prime}}$ defined in (41) to show that $\overline{B}_{2m}:=\sup_{m^{\prime}\geq m}B_{2}<\infty$ $\mathbb{P}_{\theta^{*}}$ -a.s. and $\overline{B}_{2m}$ is stationary. Therefore, $B$ is bounded by $\rho^{\lfloor(k+m)/4(p+1)\rfloor}b_{k,m}\overline{B}_{2m}$ .

$|C|+|D|$ is bounded by, with $\Delta_{t,s}:=|\text{cov}_{\theta}[\phi_{\theta t},\phi_{\theta s}|\overline{\bf V}^{k}_{-m^{\prime}},X_{-m^{\prime}}=x^{\prime}]-\text{cov}_{\theta}[\phi_{\theta t},\phi_{\theta s}|\overline{\bf V}^{k-1}_{-m^{\prime}},X_{-m^{\prime}}=x^{\prime}]|$ ,

[TABLE]

Similar to (38), we obtain

[TABLE]

Therefore, the right hand side of (43) is bounded by

[TABLE]

DMR (page 2300) show that the following holds for $k\geq 1$ , $m\geq 0$ and $t,s\leq 0$ :

[TABLE]

Consequently, (44) is bounded by

[TABLE]

where $E:=\sum_{t=-\infty}^{\infty}\rho^{\lfloor(|t|-1)/4(p+1)\rfloor}|\phi_{t}|_{\infty}$ , and $F_{m,m^{\prime}}:=\sum_{s=-m^{\prime}+1}^{-m}\rho^{\lfloor(-m-s)/8(p+1)\rfloor-\nu_{s-p}^{-m}}|\phi_{s}|_{\infty}$ . Because $E\in L^{1}(\mathbb{P}_{\theta^{*}})$ , $\overline{F}_{m}:=\sup_{m^{\prime}\geq m}F_{m,m^{\prime}}<\infty$ $\mathbb{P}_{\theta^{*}}$ -a.s., and $\overline{F}_{m}$ is stationary, (44) is bounded by $\rho^{\lfloor(k+m)/16(p+1)\rfloor}b_{k,m}E\overline{F}_{m}$ , and part (b) is proven. ∎

Proof of Proposition 3.

Define $\Upsilon_{k,m,x}(\theta):=\Psi_{k,m,x}^{2}(\theta)+\Gamma_{k,m,x}(\theta)$ and $\Upsilon_{k,\infty}(\theta):=\Psi_{k,\infty}^{2}(\theta)+\Gamma_{k,\infty}(\theta)$ , so that $\nabla_{\theta}^{2}l_{n}(\theta,x)=\sum_{k=1}^{n}\Upsilon_{k,0,x}(\theta)$ . By setting $m=0$ and letting $m^{\prime}\to\infty$ in Lemmas 4 and 5, we obtain $\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\Upsilon_{k,0,x}(\theta)-\Upsilon_{k,\infty}(\theta)|\leq(K_{2}+B_{0})k^{2}\rho^{\lfloor k/4(p+1)\rfloor}A_{k,0}+K(k^{3}+D_{0})\rho^{\lfloor k/16(p+1)\rfloor}C_{k,0}$ . Furthermore, the sum over finitely many $\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\Upsilon_{k,0,x}(\theta)-\Upsilon_{k,\infty}(\theta)|$ is $o(n)$ $\mathbb{P}_{\theta^{*}}$ -a.s. because $\mathbb{E}_{\theta^{*}}\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\Upsilon_{k,0,x}(\theta)|<\infty$ and $\mathbb{E}_{\theta^{*}}\sup_{\theta\in G}|\Upsilon_{k,\infty}(\theta)|<\infty$ from Assumption 8. Therefore, we have $\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|n^{-1}\nabla_{\theta}^{2}l_{n}(\theta,x)-n^{-1}\sum_{k=1}^{n}\Upsilon_{k,\infty}(\theta)|=o_{p}(1)$ .

Consequently, it suffices to show that

[TABLE]

Because $G$ is compact, (45) holds if, for all $\theta\in G$ ,

[TABLE]

(46) holds by ergodic theorem. Note that the left hand side of (47) is bounded by

$\lim_{\delta\to 0}\lim_{n\to\infty}n^{-1}\sum_{k=1}^{n}\sup_{|\theta^{\prime}-\theta|\leq\delta}|\Upsilon_{k,\infty}(\theta^{\prime})-\Upsilon_{k,\infty}(\theta)|$ , which equals

$\lim_{\delta\to 0}\mathbb{E}_{\theta^{*}}\sup_{|\theta^{\prime}-\theta|\leq\delta}|\Upsilon_{0,\infty}(\theta^{\prime})-\Upsilon_{0,\infty}(\theta)|$ $\mathbb{P}_{\theta^{*}}$ -a.s. from ergodic theorem. Therefore, (47) holds if

[TABLE]

Fix a point $x_{0}\in\mathcal{X}$ . The left hand side of (48) is bounded by $2A_{m}+C_{m}$ , where

[TABLE]

From Lemmas 4 and 5, $\sup_{\theta\in G}|\Upsilon_{0,m,x_{0}}(\theta)-\Upsilon_{0,\infty}(\theta)|\to_{p}0$ as $m\to\infty$ . Furthermore, we have $\mathbb{E}_{\theta^{*}}\sup_{m\geq 1}\sup_{\theta\in G}|\Upsilon_{0,m,x_{0}}(\theta)|<\infty$ and $\mathbb{E}_{\theta^{*}}\sup_{\theta\in G}|\Upsilon_{0,\infty}(\theta)|<\infty$ from Assumption 8. Therefore, $A_{m}\to 0$ as $m\to\infty$ by the dominated convergence theorem (Durrett, 2010, Exercise 2.3.7). $C_{m}=0$ from Lemma 12 if $m\geq p$ . Therefore, (48) holds, and the stated result is proven. ∎

Proof of Proposition 4.

In view of (11) and Propositions 1, 2, and 3, part (a) holds if (i) $\mathbb{E}_{\theta^{*}}[\Psi_{0,\infty}^{2}(\theta)+\Gamma_{0,\infty}(\theta)]$ is continuous in $\theta\in G$ and (ii) $\mathbb{E}_{\theta^{*}}[\Psi_{0,\infty}^{2}(\theta^{*})+\Gamma_{0,\infty}(\theta^{*})]=-I(\theta^{*})$ . (i) follows from (48). For (ii), it follows from the Louis information principle and information matrix equality that, for all $m\geq 1$ , $\mathbb{E}_{\theta^{*}}[\Psi_{0,m}^{1}(\theta^{*})(\Psi_{0,m}^{1}(\theta^{*}))^{\prime}]=-\mathbb{E}_{\theta^{*}}[\Psi_{0,m}^{2}(\theta^{*})+\Gamma_{0,m}(\theta^{*})]$ . From Lemmas 4 and 5, Assumption 8, and the dominated convergence theorem, the left hand side converges to $\mathbb{E}_{\theta^{*}}[\Psi_{0,\infty}^{1}(\theta^{*})(\Psi_{0,\infty}^{1}(\theta^{*}))^{\prime}]=I(\theta^{*})$ , and the right hand side converges to $-\mathbb{E}_{\theta^{*}}[\Psi_{0,\infty}^{2}(\theta^{*})+\Gamma_{0,\infty}(\theta^{*})]$ . Therefore, (ii) holds, and part (a) is proven.

For part (b), an elementary calculation gives, with $p_{n\theta}(x)$ denoting $p_{\theta}({\bf Y}_{1}^{n}|\overline{\bf Y}_{0},{\bf W}_{0}^{n},x)$ ,

[TABLE]

The sum of the last two terms is $o_{p}(1)$ because $\sup_{x\in\mathcal{X}}\sup_{\theta\in G}|n^{-1/2}\nabla_{\theta}\log p_{n\theta}(x)-n^{-1/2}\sum_{k=1}^{n}\Psi_{k,\infty}^{1}(\theta)|=o_{p}(1)$ . Therefore, for any $\xi$ on $\mathcal{B}(\mathcal{X})$ , we have $\sup_{x_{0}\in\mathcal{X}}\sup_{\theta\in G}|n^{-1}\nabla_{\theta}^{2}l_{n}(\theta,\xi)-n^{-1}\nabla_{\theta}^{2}l_{n}(\theta,x_{0})|=o_{p}(1)$ holds, and part (b) follows. ∎

Appendix B Auxiliary results

Lemma 1 of DMR derives the minorization condition (Rosenthal, 1995) on the conditional hidden Markov chain when $p=1$ and the covariate $W_{k}$ is absent. This lemma generalizes Lemma 1 of DMR to accommodate $p\geq 2$ and covariate $W_{k}$ .777We replace the conditioning variable $\overline{\bf Y}_{m}^{n}$ in DMR with $\overline{\bf Y}_{-m}^{n}$ , because the subsequent analysis uses $\overline{\bf Y}_{-m}^{n}$ . When $p\geq 2$ , the minorization coefficient $\omega(\cdot)$ depends on $(\overline{\bf Y}_{k-p}^{k-1},{\bf W}_{k-p}^{k-1})$ because $\overline{\bf Y}_{k-p}^{k-1}$ provide information on $X_{k}$ in addition to the information provided by $X_{k-p}$ .

Lemma 6.

Assume Assumptions 1–3. Let $m,n\in\mathbb{Z}$ with $-m\leq n$ . Then, the following holds for all $\theta\in\Theta$ ; (a) under $\mathbb{P}_{\theta}$ , conditionally on $(\overline{\bf Y}_{-m}^{n},{\bf W}_{-m}^{n})$ , $\{X_{k}\}_{k=-m}^{n}$ is an inhomogeneous Markov chain, and (b) for all $-m+p\leq k\leq n$ , there exists a function $\mu_{k,\theta}(\overline{\bf y}_{k-1}^{n},{\bf w}_{k}^{n},A)$ such that

(i)

For any $A\in\mathcal{B}(\mathcal{X})$ , $\mu_{k,\theta}(\cdot,\cdot,A)$ is Borel measurable function defined on $\mathcal{Y}^{n-k+s+1}\times\mathcal{W}^{n-k+1}$ ; 2. (ii)

*For any $(\overline{\bf y}_{k-1}^{n},{\bf w}_{k}^{n})$ , $\mu_{k,\theta}(\overline{\bf y}_{k-1}^{n},{\bf w}_{k}^{n},\cdot)$ is a probability measure on $\mathcal{B}(\mathcal{X})$ . Furthermore, *

$\mu_{k,\theta}(\overline{\bf y}_{k-1}^{n},{\bf w}_{k}^{n},\cdot)$ * is absolutely continuous with respect to $\mu$ for all $(\overline{\bf y}_{k-1}^{n},{\bf w}_{k}^{n})$ , and, for all $(\overline{\bf y}_{-m}^{n},{\bf w}_{-m}^{n})$ ,*

[TABLE]

with $\omega(\overline{\bf y}_{k-p}^{k-1},{\bf w}_{k-p}^{k-1})$ defined in (10).

Proof.

The proof uses a similar argument to the proof of Lemma 1 in DMR. Because $\{Z_{k}\}_{k=-m}^{n}$ is a Markov chain given $\{W_{k}\}_{k=-m}^{n}$ , we have, for $-m<k\leq n$ ,

[TABLE]

Therefore, $\{X_{k}\}_{k=-m}^{n}$ conditional on $(\overline{\bf Y}^{n}_{-m},{\bf W}_{-m}^{n})$ is an inhomogeneous Markov chain, and part (a) follows.

We proceed to prove part (b). Observe that if $-m+p\leq k\leq n$ ,

[TABLE]

because the left hand side of (49) can be written as

[TABLE]

The equality (49) holds even when the conditioning variable ${\bf w}_{k-p}^{n}$ on the right hand side is replaced with ${\bf w}_{k-p+1}^{n}$ , but we use ${\bf w}_{k-p}^{n}$ for notational simplicity. Write the right hand side of (49) as

[TABLE]

When $p=1$ , we have $p_{\theta}(x_{k}|x_{k-p},\overline{\bf y}^{k-1}_{k-p},{\bf w}_{k-p}^{k-1})=q_{\theta}(x_{k-1},x_{k})\in[\sigma_{-},\sigma_{+}]$ . Therefore, the stated result follows with $\mu_{k,\theta}(\overline{\bf y}_{k-1}^{n},{\bf w}_{k}^{n},A)$ defined as

[TABLE]

Note that $\int_{\mathcal{X}}p_{\theta}({\bf y}_{k}^{n}|X_{k}=x,\overline{\bf y}_{k-1},{\bf w}_{k}^{n})\mu(dx)>0$ from Assumption 3.

When $p\geq 2$ , a lower bound on $p_{\theta}(x_{k}|x_{k-p},\overline{\bf y}^{k-1}_{k-p},{\bf w}_{k-p}^{k-1})$ is obtained as

[TABLE]

Similarly, an upper bound on $p_{\theta}(x_{k}|x_{k-p},\overline{\bf y}_{k-p}^{k-1},{\bf w}_{k-p}^{k-1})$ is given by

[TABLE]

Therefore, the stated result holds with $\mu_{k,\theta}(\overline{\bf y}_{k-1}^{n},{\bf w}_{k}^{n},A)$ defined in (50). ∎

The following lemma provides the convergence rate of a Markov chain $X_{t}$ . When $X_{t}$ is time-homogeneous, this result has been proven by Theorem 1 of Rosenthal (1995). This lemma extends Rosenthal (1995) to time-inhomogeneous $X_{t}$ .

Lemma 7.

Let $\{X_{t}\}_{t\geq 1}$ be a Markov process that lies in $\mathcal{X}$ , and let $P_{t}(x,A):=\mathbb{P}(X_{t}\in A|X_{t-1}=x)$ . Suppose there is a probability measure $Q_{t}(\cdot)$ on $\mathcal{X}$ , a positive integer $p$ , and $\varepsilon_{t}\geq 0$ such that

[TABLE]

for all $x\in\mathcal{X}$ and all measurable subsets $A\subset\mathcal{X}$ . Let $X_{0}$ and $Y_{0}$ be chosen from the initial distributions $\pi_{1}$ and $\pi_{2}$ , respectively, and update them according to $P_{t}(x,A)$ . Then,

[TABLE]

Proof.

The proof follows the line of argument in the proof of Theorem 1 of Rosenthal (1995). Starting from $(X_{0},Y_{0})$ , we let $X_{t}$ and $Y_{t}$ for $t\geq 1$ progress as follows. Given the value of $X_{t}$ and $Y_{t}$ , flip a coin with the probability of heads equal to $\varepsilon_{t+p}$ . If the coin comes up heads, then choose a point $x\in\mathcal{X}$ according to $Q_{t+p}(\cdot)$ and set $X_{t+p}=Y_{t+p}=x$ , choose $(X_{t+1},\ldots,X_{t+p-1})$ and $(Y_{t+1},\ldots,Y_{t+p-1})$ independently according to the transition kernel $P_{t+1}(x_{t+1}|x_{t}),\ldots,P_{t+p-1}(x_{t+p-1}|x_{t+p-2})$ conditional on $X_{t+p}=x$ and $Y_{t+p}=x$ , and update the processes after $t+p$ so that they remain equal for all future time. If the coin comes up tails, then choose $X_{t+p}$ and $Y_{t+p}$ independently according to the distributions $(P_{t+p}^{p}(X_{t},\cdot)-\varepsilon_{t+p}Q_{t+p}(\cdot))/(1-\varepsilon_{t+p})$ and $(P_{t+p}^{p}(Y_{t},\cdot)-\varepsilon_{t+p}Q_{t+p}(\cdot))/(1-\varepsilon_{t+p})$ , respectively, and choose $(X_{t+1},\ldots,X_{t+p-1})$ and $(Y_{t+1},\ldots,Y_{t+p-1})$ independently according to the transition kernel $P_{t+1}(x_{t+1}|x_{t}),\ldots,P_{t+p-1}(x_{t+p-1}|x_{t+p-2})$ conditional on the value of $X_{t+p}$ and $Y_{t+p}$ . It is easily checked that $X_{t}$ and $Y_{t}$ are each marginally updated according to the transition kernel $P_{t}(x,A)$ .

Furthermore, $X_{t}$ and $Y_{t}$ are coupled the first time (call it $T$ ) when we choose $X_{t+p}$ and $Y_{t+p}$ both from $Q_{t+p}(\cdot)$ as earlier. It now follows from the coupling inequality that

[TABLE]

By construction, when $t$ is a multiple of $p$ , $X_{t}$ and $Y_{t}$ will couple with probability $\varepsilon_{t}$ . Hence,

[TABLE]

and the stated result follows. ∎

The following lemma corresponds to Lemma 4 of DMR and implies that $\mathbb{E}_{\theta^{*}}\left[\Delta_{0,\infty}(\theta)\right]$ is continuous in $\theta$ . This lemma is used in the proof of the consistency of the MLE.

Lemma 8.

Assume Assumptions 1–6. Then, for all $\theta\in\Theta$ ,

[TABLE]

Proof.

The proof is similar to the proof of Lemma 4 in DMR but requires a small adjustment when $p\geq 2$ . We first show that $\Delta_{0,m,x}(\theta)$ is continuous in $\theta$ for any fixed $x\in\mathcal{X}$ and any $m\geq p+1$ . Recall that $\Delta_{0,m,x}(\theta)=\log p_{\theta}(Y_{0}|\overline{\bf Y}^{-1}_{-m},{\bf W}^{0}_{-m},X_{-m}=x)$ and

[TABLE]

For $j\in\{-1,0\}$ , we have

[TABLE]

Because the integrand is bounded by $(\sigma_{+}^{0})^{m+j}\prod_{i=-m+1}^{j}b_{+}(\overline{\bf Y}_{i-1}^{i},W_{i})$ , $p_{\theta}({\bf Y}^{j}_{-m+1}|\overline{\bf Y}_{-m},X_{-m}=x,{\bf W}^{j}_{-m})$ is continuous in $\theta$ $\mathbb{P}_{\theta^{*}}$ -a.s. by the continuity of $q_{\theta}$ and $g_{\theta}$ and the bounded convergence theorem. Furthermore, when $j\geq-m+p$ , the infimum of the right hand side of (52) in $\theta$ is strictly positive $\mathbb{P}_{\theta^{*}}$ -a.s. from Assumptions 1(d) and 3. Therefore, $\Delta_{0,m,x}(\theta)$ is continuous in $\theta$ $\mathbb{P}_{\theta^{*}}$ -a.s. Because $\{\Delta_{0,m,x}(\theta)\}$ is continuous in $\theta$ and converges uniformly in $\theta\in\Theta$ $\mathbb{P}_{\theta^{*}}$ -a.s., $\Delta_{0,\infty}(\theta)$ is continuous in $\theta\in\Theta$ $\mathbb{P}_{\theta^{*}}$ -a.s. The stated result then follows from $\mathbb{E}_{\theta^{*}}\sup_{\theta\in\Theta}|\Delta_{0,\infty}(\theta)|<\infty$ by Lemma 3(c) and the dominated convergence theorem. ∎

This lemma corresponds to Lemma 9 of DMR and derives the minorization constant for the time-reversed process $\{X_{n-k}\}_{0\leq k\leq n+m}$ conditional on $(\overline{\bf Y}^{n}_{-m},{\bf W}^{n}_{-m})$ .

Lemma 9.

Assume Assumptions 1 and 2. Let $m,n\in\mathbb{Z}$ with $-m\leq n$ . Then, the following holds for all $\theta\in\Theta$ ; (a) under $\mathbb{P}_{\theta}$ , conditionally on $(\overline{\bf Y}_{-m}^{n},{\bf W}_{-m}^{n})$ , the time-reversed process $\{X_{n-k}\}_{0\leq k\leq n+m}$ is an inhomogeneous Markov chain, and (b) for all $p\leq k\leq n+m$ , there exists a function $\tilde{\mu}_{k,\theta}(\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1},A)$ such that

(i)

For any $A\in\mathcal{B}(\mathcal{X})$ , $\tilde{\mu}_{k,\theta}(\cdot,\cdot,A)$ is Borel measurable function defined on $\mathcal{Y}^{n-k+p+m+s-1}\times\mathcal{W}^{n-k+p+m}$ ; 2. (ii)

For any $(\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1},A)$ , $\tilde{\mu}_{k,\theta}(\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1},\cdot)$ is a probability measure on $\mathcal{B}(\mathcal{X})$ . Furthermore, $\tilde{\mu}_{k,\theta}(\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1},\cdot)$ is absolutely continuous with respect to $\mu$ for all $(\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1})$ , and, for all $(\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1})$ ,

[TABLE]

where $\omega(\overline{\bf y}_{n-k}^{n-k+p-1},{\bf w}_{n-k}^{n-k+p-1}):=\sigma_{-}/\sigma_{+}$ when $p=1$ , and, when $p\geq 2$ , $\omega(\overline{\bf y}_{n-k}^{n-k+p-1},{\bf w}_{n-k}^{n-k+p-1})$ is defined as in (10) but replacing $k-1$ and $k-p$ in (10) with $n-k+p-1$ and $n-k$ .

Proof.

The proof is similar to the proof of Lemma 6. Because the time-reversed process $\{Z_{n-k}\}_{0\leq k\leq n+m}$ is Markov conditional on ${\bf W}_{-m}^{n}$ and ${\bf Z}_{-m}^{n-k+1}$ is independent of ${\bf W}_{n-k+2}^{n}$ given ${\bf W}_{-m}^{n-k+1}$ , we have, for $1\leq k\leq n+m$ ,

[TABLE]

Therefore, $\{X_{n-k}\}_{0\leq k\leq n+m}$ is an inhomogeneous Markov chain given $(\overline{\bf Y}_{-m}^{n},{\bf W}_{-m}^{n})$ , and part (a) follows.

For part (b), because (i) the time-reversed process $\{Z_{n-k}\}_{0\leq k\leq n+m}$ is Markov conditional on ${\bf W}_{-m}^{n}$ , (ii) $Y_{n-k+p}$ is independent of ${\bf X}_{-m}^{n-k+p-1}$ given $(X_{n-k+p},\overline{\bf Y}_{-m}^{n-k+p-1},{\bf W}_{-m}^{n})$ , (iii) $X_{n-k+p}$ is independent of the other random variables given $X_{n-k+p-1}$ , and (iv) $W_{n-k+p}$ is independent of ${\bf Z}_{-m}^{n-k+p-1}$ given ${\bf W}_{-m}^{n-k+p-1}$ , we have, for $1\leq k\leq n+m$ ,

[TABLE]

Observe that in view of $n-k\geq-m$ ,

[TABLE]

It follows that

[TABLE]

where $G_{\theta}(x,X_{n-k+p},\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1}):=p_{\theta}(X_{n-k+p}|X_{n-k}=x,\overline{\bf y}_{n-k}^{n-k+p-1},{\bf w}_{n-k}^{n-k+p-1})\times\\ p_{\theta}(X_{n-k}=x,{\bf y}_{-m+1}^{n-k+p-1},{\bf w}_{-m+1}^{n-k+p-1}|\overline{\bf y}_{-m},w_{-m})$ .

When $p=1$ , we have $p_{\theta}(X_{n-k+p}|X_{n-k}=x,\overline{\bf y}_{n-k}^{n-k+p-1},{\bf w}_{n-k}^{n-k+p-1})=p_{\theta}(X_{n-k+1}|X_{n-k}=x)\in[\sigma_{-},\sigma_{+}]$ . Therefore, the stated result follows with $\tilde{\mu}_{k,\theta}(\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1},A)$ defined as

[TABLE]

Note that $\int_{\mathcal{X}}p_{\theta}(X_{n-k}=x,{\bf y}_{-m+1}^{n-k+p-1},{\bf w}_{-m+1}^{n-k+p-1}|\overline{\bf y}_{-m},w_{-m})\mu(dx)>0$ from Assumption 3.

When $p\geq 2$ , it follows from a derivation similar to (51) that $p_{\theta}(x_{n-k+p}|x_{n-k},\overline{\bf y}_{n-k}^{n-k+p-1},{\bf w}_{n-k}^{n-k+p-1})$ is bounded from below by

[TABLE]

where $H:=\inf_{\theta}\inf_{{\bf x}^{n-k+p-1}_{n-k+1}}\prod_{i=n-k+1}^{n-k+p-1}g_{\theta}(y_{i}|\overline{\bf y}_{i-1},x_{i},w_{i})/\sup_{\theta}\sup_{{\bf x}^{n-k+p-1}_{n-k+1}}\prod_{i=n-k+1}^{n-k+p-1}g_{\theta}(y_{i}|\overline{\bf y}_{i-1},x_{i},w_{i})$ , and an upper bound on $p_{\theta}(x_{n-k+p}|x_{n-k},\overline{\bf y}_{n-k}^{n-k+p-1},{\bf w}_{n-k}^{n-k+p-1})$ is given by

$\sup_{\theta}\sup_{x_{n-k},x_{n-k+p}}q_{\theta}^{p}(x_{n-k},x_{n-k+p})/H$ . Therefore, the stated result holds with $\tilde{\mu}_{k}$ defined in (54). ∎

This lemma bounds the distance between the distributions of $X_{k}$ given $(\overline{\bf Y}_{-m}^{n},{\bf W}_{-m}^{n})$ and $(\overline{\bf Y}_{-m}^{n-1},{\bf W}_{-m}^{n-1})$ . This lemma shows that the time-reversed process $\{X_{n-k}\}_{0\leq k\leq n+m}$ conditional on $(\overline{\bf Y}^{n}_{-m},{\bf W}^{n}_{-m})$ forgets its initial conditioning variable (i.e., $Y_{n}$ and $W_{n}$ ) exponentially fast. Part (b) corresponds to equation (39) on page 2294 of DMR.

Lemma 10.

*Assume Assumptions 1 and 2. Let $m,n\in\mathbb{Z}$ with $m,n\geq 0$ and $\theta\in\Theta$ . Then,

(a) for all $-m\leq k\leq n$ and all $(\overline{\bf y}_{-m}^{n},{\bf w}_{-m}^{n})$ ,*

[TABLE]

(b) for all $-m+1\leq k\leq n$ and all $(\overline{\bf y}_{-m}^{n},{\bf w}_{-m}^{n},x)$ ,

[TABLE]

Proof.

When $k\geq n-1$ , the stated result holds trivially because $\prod_{i=1}^{j}a_{i}=1$ when $j<i$ . We first show part (a) for $k\leq n-2$ . Because the time-reversed process $\{Z_{n-k}\}_{0\leq k\leq n+m}$ is Markov conditional on ${\bf W}_{-m}^{n}$ and $W_{n}$ is independent of $Z_{n-1}$ given $W_{n-1}$ , we have $\mathbb{P}_{\theta}(X_{k}\in\cdot|\overline{\bf y}_{-m}^{n},{\bf w}_{-m}^{n})=\int\mathbb{P}_{\theta}(X_{k}\in\cdot|x_{n-1},\overline{\bf y}_{-m}^{n-1},{\bf w}_{-m}^{n-1})\mathbb{P}_{\theta}(dx_{n-1}|\overline{\bf y}_{-m}^{n},{\bf w}_{-m}^{n})$ . Similarly, we obtain $\mathbb{P}_{\theta}(X_{k}\in\cdot|\overline{\bf y}_{-m}^{n-1},{\bf w}_{-m}^{n-1})=\int\mathbb{P}_{\theta}(X_{k}\in\cdot|x_{n-1},\overline{\bf y}_{-m}^{n-1},{\bf w}_{-m}^{n-1})\mathbb{P}_{\theta}(dx_{n-1}|\overline{\bf y}_{-m}^{n-1},{\bf w}_{-m}^{n-1})$ . It follows that

[TABLE]

Therefore, the stated result follows from applying Lemmas 9 and 7 to the time-reversed process $\{X_{n-i}\}_{i=1}^{n-k}$ conditional on $(\overline{\bf Y}_{-m}^{n-1},{\bf W}_{-m}^{n-1})$ .

For part (b) for $k\leq n-2$ , by using a similar argument to the proof of Lemma 9, we can show that (i) conditionally on $(\overline{\bf Y}_{-m}^{n},{\bf W}_{-m}^{n},X_{-m})$ , the time-reversed process $\{X_{n-k}\}_{0\leq k\leq n+m-1}$ is an inhomogeneous Markov chain, and (ii) for all $p\leq k\leq n+m-1$ , there exists a probability measure $\breve{\mu}_{k,\theta}(\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1},x,A)$ such that, for all $(\overline{\bf y}_{-m}^{n-k+p-1},{\bf w}_{-m}^{n-k+p-1},x)$ ,

[TABLE]

with the same $\omega(\overline{\bf y}_{n-k}^{n-k+p-1},{\bf w}_{n-k}^{n-k+p-1})$ as in Lemma 9. Therefore, the stated result follows from a similar argument to the proof of part (a). ∎

The following lemma is used in the proof of Lemmas 4 and 5. This lemma provides the bounds on the difference in the conditional expectations of $\phi_{\theta t}^{j}=\phi^{j}(\theta,\overline{\bf Z}_{t-1}^{t},W_{t})$ when the conditioning sets are different. Define $\Omega_{\ell,k}:=\prod_{i=1}^{\lfloor(\ell-k)/p\rfloor}(1-\omega(\overline{\bf V}_{k+pi-p}^{k+pi-1}))$ and $\tilde{\Omega}_{\ell,k}:=\prod_{i=1}^{\lfloor(k-\ell)/p\rfloor}(1-\omega(\overline{\bf V}_{k-1-pi+1}^{k-1-pi+p}))$ with defining $\prod_{i=a}^{b}x_{i}:=1$ if $b<a$ , where $\omega(\cdot)$ is defined in Lemma 6 and $\overline{\bf V}_{a}^{b}:=(\overline{\bf{Y}}_{a}^{b},{\bf W}_{a}^{b})$ .

Lemma 11.

Assume Assumptions 1–7. Then, for all $m^{\prime}\geq m\geq 0$ , all $-m<s\leq t\leq n$ , all $\theta\in G$ , and all $x,x^{\prime}\in\mathcal{X}$ and $j=1,2$ ,

[TABLE]

and

[TABLE]

Proof of Lemma 11.

To prove parts (a)–(c), we first show that, for all $-m\leq k\leq t-1$ , all probability measures $\mu_{1}$ and $\mu_{2}$ on $\mathcal{B}(\mathcal{X})$ , and all $\overline{\bf V}_{-m}^{n}$ ,

[TABLE]

When $k=t-1$ , (55) holds trivially. When $-m\leq k<t-1$ , equation (49) and the Markov property of $Z_{t}$ imply that $\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}\in A|X_{k},\overline{\bf V}_{-m}^{n})=\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}\in A|X_{k},\overline{\bf V}_{k}^{n})=\int\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}\in A|X_{t-1}=x_{t-1},\overline{\bf V}_{k}^{n})p_{\theta}(x_{t-1}|X_{k},\overline{\bf V}_{k}^{n})\mu(dx_{t-1})$ . Consequently, from the property of the total variation distance, the left hand side of (55) is bounded by

[TABLE]

This is bounded by $\prod_{i=1}^{\lfloor(t-1-k)/p\rfloor}(1-\omega(\overline{\bf V}_{k+pi-p}^{k+pi-1}))$ from Corollary 1, and (55) is proven.

We proceed to show parts (a)–(c). For part (a), observe that

[TABLE]

and $p_{\theta}({\bf x}^{t}_{t-1}|\overline{\bf V}_{-m}^{n})=\int p_{\theta}({\bf x}^{t}_{t-1}|\overline{\bf V}_{-m}^{n},x_{-m})p_{\theta}(x_{-m}|\overline{\bf V}_{-m}^{n})\mu(dx_{-m})$ . Note that, for any conditioning set $\mathcal{G}$ , we have $\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}|\mathcal{G})=0$ if $q_{\theta}(X_{t-1},X_{t})=0$ . Therefore, the right hand side of (56) and (57) are written as

[TABLE]

with $\mathcal{F}=\{\overline{\bf V}_{-m}^{n},x_{-m}\},\{\overline{\bf V}_{-m}^{n}\}$ . Therefore, part (a) follows from the property of the total variation distance and setting $k=-m$ in (55). Parts (b) and (c) are proven similarly.

Part (d) holds if we show that, for all $-m+1\leq t\leq n$ and $\overline{\bf V}_{-m}^{n}$ ,

[TABLE]

When $t\geq n-1$ , (58) holds trivially. When $t\leq n-2$ , observe that the time-reversed process $\{Z_{n-k}\}_{0\leq k\leq n+m}$ is Markov. Hence, for any $-m+1\leq t\leq k$ , we have $\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}\in A|X_{-m},\overline{\bf V}_{-m}^{k})=\int\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}\in A|X_{t}=x_{t},\overline{\bf V}_{-m}^{t})p_{\theta}(x_{t}|X_{-m},\overline{\bf V}_{-m}^{k})\mu(dx_{t})$ . Therefore, (58) is proven similarly to (55) by using Lemma 10(b). Part (e) is proven similarly by using Lemma 10(a).

We proceed to show parts (f)–(k). In view of (57), part (f) holds if we show that, for all $-m<s\leq t\leq n$ ,

[TABLE]

When $s\geq t-1$ , (59) holds trivially because $\prod_{i=1}^{j}a_{i}=1$ when $j<i$ . When $s\leq t-2$ , observe that $\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}\in A,{\bf X}_{s-1}^{s}\in B|\overline{\bf V}_{-m}^{n})=\int_{B}\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}\in A|{\bf X}_{s-1}^{s}={\bf x}_{s-1}^{s},\overline{\bf V}_{-m}^{n})p_{\theta}({\bf x}_{s-1}^{s}|\overline{\bf V}_{-m}^{n})\mu^{\otimes 2}(d{\bf x}_{s-1}^{s})$ and $\mathbb{P}_{\theta}({\bf X}_{s-1}^{s}\in B|\overline{\bf V}_{-m}^{n})=\int_{B}p_{\theta}({\bf x}_{s-1}^{s}|\overline{\bf V}_{-m}^{n})\mu^{\otimes 2}(d{\bf x}_{s-1}^{s})$ . Hence, in view of the Markov property of $\{X_{k}\}$ given $\overline{\bf V}_{-m}^{n}$ , the left hand side of (59) is bounded by $\sup_{A}\sup_{x_{s}\in\mathcal{X}}|\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}\in A|X_{s}=x_{s},\overline{\bf V}_{-m}^{n})-\mathbb{P}_{\theta}({\bf X}_{t-1}^{t}\in A|\overline{\bf V}_{-m}^{n})|$ . From (55), this is bounded by $\prod_{i=1}^{\lfloor(t-1-s)/p\rfloor}(1-\omega(\overline{\bf V}_{s+pi-p}^{s+pi-1}))$ , and (59) follows. Part (g) is proven similarly by replacing the conditioning variable $\overline{\bf V}_{-m}^{n}$ with $(X_{-m},\overline{\bf V}_{-m}^{n})$ . Parts (h)–(k) follow from (55), (58), and the relation $|\mathrm{cov}(X,Y|\mathcal{F}_{1})-\mathrm{cov}(X,Y|\mathcal{F}_{2})|\leq|E(XY|\mathcal{F}_{1})-E(XY|\mathcal{F}_{2})|+|E(X|\mathcal{F}_{1})-E(X|\mathcal{F}_{2})|E(Y|\mathcal{F}_{1})+E(X|\mathcal{F}_{2})|E(Y|\mathcal{F}_{1})-E(Y|F_{2})|$ . ∎

The following lemma corresponds to Lemma 14 of DMR and shows that $\mathbb{E}_{\theta^{*}}[\Psi_{0,m,x}^{1}(\theta)]$ , $\mathbb{E}_{\theta^{*}}[\Psi_{0,m,x}^{2}(\theta)]$ , and $\mathbb{E}_{\theta^{*}}[\Gamma_{0,m,x}(\theta)]$ are continuous in $\theta$ .

Lemma 12.

Assume Assumptions 1–8. Then, for $j=1,2$ , all $x\in\mathcal{X}$ and $m\geq p$ , the functions $\Psi_{0,m,x}^{j}(\theta)$ and $\Gamma_{0,m,x}(\theta)$ are continuous in $\theta\in G$ $\mathbb{P}_{\theta^{*}}$ -a.s. In addition,

[TABLE]

Proof.

The proof is similar to the proof of Lemma 14 in DMR. For brevity, we suppress $W_{t}$ and ${\bf W}^{0}_{-m}$ from $\phi^{j}(\theta,\overline{\bf Z}^{t}_{t-1},W_{t})$ and the conditioning set. We prove part (a) first. Note that $\sup_{\theta\in G}\sup_{x\in\mathcal{X}}|\Psi_{0,m,x}^{j}(\theta)|^{3-j}\leq(2\sum_{t=-m+1}^{0}|\phi_{t}^{j}|_{\infty})^{3-j}\in L^{1}(\mathbb{P}_{\theta^{*}})$ . Hence, the stated result holds if, for $m\geq p$ and $-m+1\leq t\leq 0$ ,

[TABLE]

Write

[TABLE]

For all ${\bf x}^{t}_{t-1}$ such that $p_{\theta}({\bf x}^{t}_{t-1}|\overline{\bf Y}^{0}_{-m},X_{-m}=x)>0$ , $\phi^{j}(\theta,{\bf x}^{t}_{t-1},\overline{\bf Y}^{t}_{t-1})$ is continuous in $\theta$ and bounded by $|\phi_{t}^{j}|_{\infty}<\infty$ . Furthermore,

[TABLE]

Here, $p_{\theta}({\bf X}^{t}_{t-1}={\bf x}^{t}_{t-1},{\bf Y}^{0}_{-m+1}|\overline{\bf Y}_{-m},X_{-m}=x)$ is continuous in $\theta$ (see (52)) and bounded from above by $(\sigma_{+}^{0})^{m}\prod_{i=-m+1}^{0}b_{+}(\overline{\bf Y}_{i-1}^{i})$ , and $p_{\theta}({\bf Y}^{0}_{-m+1}|\overline{\bf Y}_{-m},X_{-m}=x)$ is continuous in $\theta$ and bounded from below by $\sigma_{-}^{\lfloor m/p\rfloor}\prod_{t=-m+1}^{0}\int\inf_{\theta\in G}g_{\theta}(Y_{t}|\overline{\bf Y}_{t-1},x_{t})\mu(dx_{t})>0$ . Consequently, $p_{\theta}({\bf X}^{t}_{t-1}={\bf x}^{t}_{t-1}|\overline{\bf Y}^{0}_{-m},X_{-m}=x)$ is continuous in $\theta$ and bounded from above uniformly in $\theta\in G$ $\mathbb{P}_{\theta^{*}}$ -a.s., and the integrand on the right hand side of (60) is continuous in $\theta$ and bounded from above uniformly in $\theta\in G$ $\mathbb{P}_{\theta^{*}}$ -a.s. From the dominated convergence theorem, the left hand side of (60) is continuous in $\theta$ $\mathbb{P}_{\theta^{*}}$ -a.s, and part (a) is proven.

Part (b) holds if, for $-m+1\leq s\leq t\leq 0$ ,

[TABLE]

This holds by a similar argument to part (a), and part (b) follows. ∎

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ang and Bekaert (2002) Ang, A. and Bekaert, G. (2002), “International Asset Allocation with Regime Shifts,” Review of Financial Studies , 15, 1137–1187.
2Ang and Timmermann (2012) Ang, A. and Timmermann, A. (2012), “Regime Changes and Financial Markets,” Annual Review of Financial Economics , 4, 313–337.
3Bickel et al. (1998) Bickel, P. J., Ritov, Y., and Rydén, T. (1998), “Asymptotic Normality of the Maximum-Likelihood Estimator for General Hidden Markov Models,” Annals of Statistics , 26, 1614–1635.
4Boldin (1996) Boldin, M. D. (1996), “A Check on the Robustness of Hamilton’s Markov Switching Model Approach to the Economic Analysis of the Business Cycle,” Studies in Nonlinear Dynamics and Econometrics , 1, 35–46.
5Camacho and Perez-Quiros (2007) Camacho, M. and Perez-Quiros, G. (2007), “Jump-and-Rest Effect of U.S. Business Cycles,” Studies in Nonlinear Dynamics and Econometrics , 11, 1–39.
6Carrasco et al. (2014) Carrasco, M., Hu, L., and Ploberger, W. (2014), “Optimal Test for Markov Switching Parameters,” Econometrica , 82, 765–784.
7Cho and White (2007) Cho, J. S. and White, H. (2007), “Testing for Regime Switching,” Econometrica , 75, 1671–1720.
8Dahlquista and Gray (2000) Dahlquista, M. and Gray, S. F. (2000), “Regime-Switching and Interest Rates in the European Monetary System,” Journal of International Economics , 50, 399–419.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

**Asymptotic Properties of the Maximum Likelihood Estimator

Abstract

1 Introduction

Example 1** (Hamilton (1989)).**

Example 2** (SWARCH model of Hamilton and Susmel (1994)).**

Example 3** (Bounce-back effect model of Kim et al. (2005)).**

Example 4** (Markov regime-switching conditional duration (MS-CD) model).**

2 Model and assumptions

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

3 Uniform forgetting of the conditional hidden Markov chain

Lemma 1**.**

4 Consistency of the MLE

Assumption 4**.**

Assumption 5**.**

Lemma 2**.**

Lemma 3**.**

Assumption 6**.**

Proposition 1**.**

Corollary 1**.**

5 Asymptotic distribution of the MLE

Assumption 7**.**

Assumption 8**.**

5.1 Asymptotic distribution of the score function

Lemma 4**.**

Proposition 2**.**

5.2 Convergence of the Hessian

Lemma 5**.**

Proposition 3**.**

Proposition 4**.**

5.3 Convergence of the covariance matrix estimate

Proposition 5**.**

6 Simulation

6.1 Hamilton’s model

6.2 MS-CD model

7 Empirical application: Duration between stock price changes

Appendix A Proofs

Proof of Lemma 1.

Proof of Lemma 2.

Proof of Lemma 3.

Proof of Proposition 1.

Proof of Corollary 1.

Proof of Lemma 4.

Proof of Proposition 2.

Proof of Lemma 5.

Proof of Proposition 3.

Proof of Proposition 4.

Appendix B Auxiliary results

Lemma 6**.**

Proof.

Lemma 7**.**

Proof.

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

Lemma 10**.**

Proof.

Lemma 11**.**

Proof of Lemma 11.

Lemma 12**.**

Proof.

Example 1 (Hamilton (1989)).

Example 2 (SWARCH model of Hamilton and Susmel (1994)).

Example 3 (Bounce-back effect model of Kim et al. (2005)).

Example 4 (Markov regime-switching conditional duration (MS-CD) model).

Assumption 1.

Assumption 2.

Assumption 3.

Lemma 1.

Assumption 4.

Assumption 5.

Lemma 2.

Lemma 3.

Assumption 6.

Proposition 1.

Corollary 1.

Assumption 7.

Assumption 8.

Lemma 4.

Proposition 2.

Lemma 5.

Proposition 3.

Proposition 4.

Proposition 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.