Nonstationary Gauss-Markov Processes: Parameter Estimation and   Dispersion

Peida Tian; Victoria Kostina

arXiv:1907.00304·cs.IT·March 29, 2021

Nonstationary Gauss-Markov Processes: Parameter Estimation and Dispersion

Peida Tian, Victoria Kostina

PDF

TL;DR

This paper analyzes the maximum likelihood estimation error for a nonstationary Gauss-Markov process, providing tight nonasymptotic bounds and applying these results to determine the source dispersion in lossy compression.

Contribution

It introduces a tight nonasymptotic error bound for parameter estimation in nonstationary Gauss-Markov processes and extends dispersion analysis to the nonstationary case.

Findings

01

Bound on estimation error decays exponentially and is tight for hundreds of samples.

02

Dispersion formula for nonstationary sources matches that of stationary sources under certain conditions.

03

New eigenvalue bounding techniques for covariance matrices in nonstationary processes.

Abstract

This paper provides a precise error analysis for the maximum likelihood estimate $\overset{a}{^}_{ML} (u_{1}^{n})$ of the parameter $a$ given samples $u_{1}^{n} = (u_{1}, \dots, u_{n})^{'}$ drawn from a nonstationary Gauss-Markov process $U_{i} = a U_{i - 1} + Z_{i}, i \geq 1$ , where $U_{0} = 0$ , $a > 1$ , and $Z_{i}$ 's are independent Gaussian random variables with zero mean and variance $σ^{2}$ . We show a tight nonasymptotic exponentially decaying bound on the tail probability of the estimation error. Unlike previous works, our bound is tight already for a sample size of the order of hundreds. We apply the new estimation bound to find the dispersion for lossy compression of nonstationary Gauss-Markov sources. We show that the dispersion is given by the same integral formula that we derived previously for the asymptotically stationary Gauss-Markov sources, i.e., $∣ a ∣ < 1$ . New ideas in the nonstationary case…

Tables1

Table 1. TABLE I : Order dependence in η n subscript 𝜂 𝑛 \eta_{n} of the quantities involved in determining α n , ℓ subscript 𝛼 𝑛 ℓ \alpha_{n,\ell} in ( 128 ) and ( 129 ).

$α_{1}$	$r_{1}$	$r_{2}$	$r_{2} - r_{1}$	$q$	$- \frac{α_{1} - r_{1}}{α_{1} - r_{2}}$
$- Θ (η_{n}^{2})$	$- Θ (1)$	$Θ (η_{n}^{2})$	$Θ (1)$	$Θ (1)$	$Θ (1 / η_{n}^{2})$

Equations579

U_{i} = a U_{i - 1} + Z_{i}, \forall i \geq 1,

U_{i} = a U_{i - 1} + Z_{i}, \forall i \geq 1,

\overset{a}{^}_{ML} (u_{1}^{n}) = \frac{\sum _{i = 1}^{n - 1} u _{i} u _{i + 1}}{\sum _{i = 1}^{n - 1} u _{i}^{2}} .

\overset{a}{^}_{ML} (u_{1}^{n}) = \frac{\sum _{i = 1}^{n - 1} u _{i} u _{i + 1}}{\sum _{i = 1}^{n - 1} u _{i}^{2}} .

h (n) ≜ ⎩ ⎨ ⎧ \frac{n}{1 - a ^{2}}, \frac{n}{2}, \frac{∣ a ∣ ^{n}}{a ^{2} - 1}, ∣ a ∣ < 1, ∣ a ∣ = 1, ∣ a ∣ > 1.

h (n) ≜ ⎩ ⎨ ⎧ \frac{n}{1 - a ^{2}}, \frac{n}{2}, \frac{∣ a ∣ ^{n}}{a ^{2} - 1}, ∣ a ∣ < 1, ∣ a ∣ = 1, ∣ a ∣ > 1.

\frac{B ^{2} ( 1 ) - 1}{2 \int _{0}^{1} B ^{2} ( t ) d t},

\frac{B ^{2} ( 1 ) - 1}{2 \int _{0}^{1} B ^{2} ( t ) d t},

X_{i + 1}

X_{i + 1}

Y_{i}

P^{+} (n, a, η)

P^{+} (n, a, η)

P^{-} (n, a, η)

P (n, a, η) ≜ - \frac{1}{n} lo g P [∣ \overset{a}{^}_{ML} (U_{1}^{n}) - a ∣ > η] .

P (n, a, η) ≜ - \frac{1}{n} lo g P [∣ \overset{a}{^}_{ML} (U_{1}^{n}) - a ∣ > η] .

P^{+} (n, a, η) (and P^{-} (n, a, η)) \geq \frac{1}{2} lo g (1 + η^{2}) .

P^{+} (n, a, η) (and P^{-} (n, a, η)) \geq \frac{1}{2} lo g (1 + η^{2}) .

P^{+} (n, a, η) (and P^{-} (n, a, η)) \geq \frac{η ^{2}}{2 ( 1 + y _{η} )},

P^{+} (n, a, η) (and P^{-} (n, a, η)) \geq \frac{η ^{2}}{2 ( 1 + y _{η} )},

d (x_{1}^{n}, y_{1}^{n}) ≜ \frac{1}{n} i = 1 \sum n (x_{i} - y_{i})^{2} .

d (x_{1}^{n}, y_{1}^{n}) ≜ \frac{1}{n} i = 1 \sum n (x_{i} - y_{i})^{2} .

M^{⋆} (n, d, ϵ)

M^{⋆} (n, d, ϵ)

R (n, d, ϵ)

ϵ^{⋆} (n, d, M)

ϵ^{⋆} (n, d, M)

- \frac{1}{n} lo g ϵ^{⋆} (n, d, M) = P_{\hat{X}} min D (P_{\hat{X}} ∣∣ P_{X}) + O (\frac{lo g n}{n}),

- \frac{1}{n} lo g ϵ^{⋆} (n, d, M) = P_{\hat{X}} min D (P_{\hat{X}} ∣∣ P_{X}) + O (\frac{lo g n}{n}),

R (n, d, ϵ) = R_{X} (d) + Q^{- 1} (ϵ) \frac{V ( d )}{n} + O (\frac{lo g n}{n}),

R (n, d, ϵ) = R_{X} (d) + Q^{- 1} (ϵ) \frac{V ( d )}{n} + O (\frac{lo g n}{n}),

α_{1}

α_{1}

α_{ℓ}

β_{1}

β_{1}

β_{ℓ}

S_{n}^{+} ≜ {s \in R : s > 0, α_{ℓ} < \frac{1}{2 σ ^{2}}, \forall ℓ \in [n]},

S_{n}^{+} ≜ {s \in R : s > 0, α_{ℓ} < \frac{1}{2 σ ^{2}}, \forall ℓ \in [n]},

S_{n}^{-} ≜ {s \in R : s > 0, β_{ℓ} < \frac{1}{2 σ ^{2}}, \forall ℓ \in [n]} .

P^{+} (n, a, η)

P^{+} (n, a, η)

P^{-} (n, a, η)

n \to \infty lim sup P^{+} (n, a, η) \leq n \to \infty lim sup s \in S_{n}^{+} sup \frac{1}{2 n} ℓ = 1 \sum n - 1 lo g (1 - 2 σ^{2} α_{ℓ}),

n \to \infty lim sup P^{+} (n, a, η) \leq n \to \infty lim sup s \in S_{n}^{+} sup \frac{1}{2 n} ℓ = 1 \sum n - 1 lo g (1 - 2 σ^{2} α_{ℓ}),

S_{\infty}^{+}

S_{\infty}^{+}

S_{\infty}^{-}

S_{n + 1}^{+} \subseteq S_{n}^{+}, S_{n + 1}^{-} \subseteq S_{n}^{-} .

S_{n + 1}^{+} \subseteq S_{n}^{+}, S_{n + 1}^{-} \subseteq S_{n}^{-} .

S_{\infty}^{+} = (0, \frac{2 η}{σ ^{2}}],

S_{\infty}^{+} = (0, \frac{2 η}{σ ^{2}}],

S_{\infty}^{-} ⫌ (0, \frac{2 η}{σ ^{2}}] .

S_{\infty}^{-} ⫌ (0, \frac{2 η}{σ ^{2}}] .

S_{1}^{+} = S_{1}^{-} = {s \in R : 0 < s < \frac{η + 1 + η ^{2}}{σ ^{2}}} .

S_{1}^{+} = S_{1}^{-} = {s \in R : 0 < s < \frac{η + 1 + η ^{2}}{σ ^{2}}} .

n \to \infty lim inf P^{+} (n, a, η)

n \to \infty lim inf P^{+} (n, a, η)

n \to \infty lim inf P^{-} (n, a, η)

n \to \infty lim inf P (n, a, η)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Nonstationary Gauss-Markov Processes:

Parameter Estimation and Dispersion

Peida Tian, Victoria Kostina P. Tian and V. Kostina are with the Department of Electrical Engineering, California Institute of Technology. (e-mail: {ptian, vkostina}@caltech.edu). This research was supported in part by the National Science Foundation (NSF) under Grant CCF-1751356. A preliminary version [1] of this paper was presented at the 2019 IEEE International Symposium on Information Theory.Copyright ©2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

Abstract

This paper provides a precise error analysis for the maximum likelihood estimate $\hat{a}_{\text{ML}}(u_{1}^{n})$ of the parameter $a$ given samples $u_{1}^{n}=(u_{1},\ldots,u_{n})^{\prime}$ drawn from a nonstationary Gauss-Markov process $U_{i}=aU_{i-1}+Z_{i},~{}i\geq 1$ , where $U_{0}=0$ , $a>1$ , and $Z_{i}$ ’s are independent Gaussian random variables with zero mean and variance $\sigma^{2}$ . We show a tight nonasymptotic exponentially decaying bound on the tail probability of the estimation error. Unlike previous works, our bound is tight already for a sample size of the order of hundreds. We apply the new estimation bound to find the dispersion for lossy compression of nonstationary Gauss-Markov sources. We show that the dispersion is given by the same integral formula that we derived previously for the asymptotically stationary Gauss-Markov sources, i.e., $|a|<1$ . New ideas in the nonstationary case include separately bounding the maximum eigenvalue (which scales exponentially) and the other eigenvalues (which are bounded by constants that depend only on $a$ ) of the covariance matrix of the source sequence, and new techniques in the derivation of our estimation error bound.

Index Terms:

Parameter estimation, maximum likelihood estimator, unstable processes, finite blocklength analysis, lossy compression, sources with memory, rate-distortion theory, system identification, covering in stochastic processes, adaptive control.

I Introduction

I-A Overview

We consider two related problems that concern a scalar Gauss-Markov process $\{U_{i}\}_{i=1}^{\infty}$ , defined by $U_{0}=0$ and

[TABLE]

where $Z_{i}$ ’s are independent Gaussian random variables with zero mean and variance $\sigma^{2}$ .

The first problem is parameter estimation: given samples $u_{1}^{n}$ drawn from the Gauss-Markov source, we seek to design and analyse estimators for the unknown system parameter $a$ . The consistency and asymptotic distribution of the maximum likelihood (ML) estimator have been studied in the literature [2, 3, 4, 5, 6, 7]. Our main contribution is a large deviation bound on the estimation error of the ML estimator. Our numerical experiments indicate that our new bound is tighter than previously known results [8, 9, 10].

The second problem is the nonasymptotic performance of the optimal lossy compressor of the Gauss-Markov process. An encoder outputs $nR$ bits for each realization $u_{1}^{n}$ . Once the decoder receives the $nR$ bits, it produces $\hat{u}_{1}^{n}$ as a reproduction of $u_{1}^{n}$ . The distortion between $u_{1}^{n}$ and $\hat{u}_{1}^{n}$ is measured by the mean squared error (MSE). Two commonly used criteria to quantify the distortion of a lossy compression scheme are the average distortion criterion and the excess-distortion probability criterion. The rate-distortion theory, initiated by Shannon [11] and further pioneered in [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], studies the optimal tradeoff between the rate $R$ and the distortion. In the limit of large blocklength $n$ , the minimum rate $R$ required to achieve average distortion $d$ is given by the rate-distortion function. The nonasymptotic version of the rate-distortion problem [21, 23, 24, 25, 26] studies the rate-distortion tradeoff for finite blocklength $n$ . Our main contribution is a coding theorem that characterizes the gap between the rate-distortion function and the minimum rate $R$ at blocklength $n$ for the nonstationary Gauss-Markov source ( $a>1$ ), under the excess-distortion probability criterion. We leverage our result on the ML estimator to analyze lossy compression. Namely, we apply our bound on the estimation error of the ML estimator to construct a typical set of the sequences whose estimated parameter $a$ is close to the true $a$ . We then use the typical set in our achievability proof of the nonasymptotic coding theorem.

Without loss of of generality, we assume that $a\geq 0$ in this paper, since, otherwise, we can consider another random process $\{U^{\prime}_{i}\}_{i=1}^{\infty}$ defined by the invertible mapping $U^{\prime}_{i}\triangleq(-1)^{i}U_{i}$ that satisfies $U^{\prime}_{i}=(-a)U^{\prime}_{i-1}+(-1)^{i}Z_{i}$ , where $(-1)^{i}Z_{i}$ ’s are also independent zero-mean Gaussian random variables with variance $\sigma^{2}$ . We distinguish the following three cases:

•

$0<a<1$ : the asymptotically stationary case;

•

$a=1$ : the unit-root case;

•

$a>1$ : the nonstationary case.

In this paper, we mostly focus on the nonstationary case.

I-B Motivations

Estimation of parameters of stochastic processes from their realizations has many applications. In the statistical analysis of economic time series [2, 27, 28], the Gauss-Markov process $\{U_{i}\}_{i=1}^{\infty}$ is used to model the varying price of a certain commodity with time, and the ML estimate of the unknown coefficient $a$ is then used to predict future prices. In [29] and [30, Sec. 5], the Gauss-Markov process with $a=1$ is used to model the stochastic structure of the velocity of money. The Gauss-Markov process, also known as the autoregressive process of order 1 (AR(1)), is a special case of the general autoregressive-moving-average (ARMA) model [31, 32], for which various estimation and prediction procedures have been proposed, e.g. the Box-Jenkins method [32]. The Gauss-Markov process is also a special case of the linear state-space model (e.g. [33, Chap. 5]) that is popular in control theory. One of the problems in control is system identification [34], which is the problem of building mathematical models using measured data from unknown dynamical systems. Parameter estimation is one of the common methods used in system identification where the dynamical system is modeled by a state-space model [34, Chap. 7] with unknown parameters. In modern data-driven control systems, where the goal is to control an unknown nonstationary system given measured data, parameter estimation methods are used as a first step in designing controllers [10] [35, Sec. 1.2]. In speech signal processing, the linear predictive coding algorithm [36] relies on parameter estimation (the ordinary least squares estimate, or, equivalently, the maximum likelihood estimate assuming Gaussian noise) to fit a higher-order Gauss-Markov process, see [36, App. C]. A fine-grained analysis of the ML estimate is instrumental in optimizing the design of all these systems. Our nonasymptotic analysis leading up to a large deviation bound for the ML estimate in our simple setting can provide insights for analyzing more complex random processes, e.g., higher-order autoregressive processes and vector systems.

Understanding finite-blocklength lossy compression of the Gauss-Markov process fits into a continuing effort by many researchers to advance the rate-distortion theory of information sources with memory, see [13, 14, 15, 17, 18, 20, 22, 37, 38, 39, 40, 41, 19, 42, 43, 44], as well as into a newer push [21, 23, 24, 25, 26, 45, 46, 47, 48, 49, 50] to understand the fundamental limits of low latency communication. There is a tight connection between lossy compression of the nonstationary Gauss-Markov process and control of an unstable linear system under communication constraints [51, 52]. Namely, the minimum channel capacity needed to achieve a given LQG (linear quadratic Gaussian) cost for the plant [51, Eq. (1)] is lower-bounded by the causal rate-distortion function of the Gauss-Markov process [51, Eq. (9)]. See [52, Th. 1] for more details. Being more restrictive on the coding schemes, the causal rate-distortion function is further lower-bounded by the traditional rate-distortion function. The result in this paper on the rate-distortion tradeoff in the finite blocklength regime provides a lower bound on the minimum communication rate required to ensure that the LQG cost stays below a desired threshold with desired probability at the end of a finite horizon. Finally, the aforementioned linear predictive coding algorithm [36] is connected to lossy compression of autoregressive processes, see a recent historical note by Gray [53, p.2].

I-C Notations

For $n\in\mathbb{N}$ , we use $[n]$ to denote the set $\{1,2,...,n\}$ . We use the standard notations for the asymptotic behaviors $O(\cdot),o(\cdot)$ , $\Theta(\cdot)$ , $\Omega(\cdot)$ and $\omega(\cdot)$ . Namely, let $f(n)$ and $g(n)$ be two functions of $n$ , then $f(n)=O(g(n))$ means that there exists a constant $c>0$ and $n_{0}\in\mathbb{N}$ such that $|f(n)|\leq c|g(n)|$ for any $n\geq n_{0}$ ; $f(n)=o(g(n))$ means $\lim_{n\rightarrow\infty}f(n)/g(n)=0$ ; $f(n)=\Theta(g(n))$ means there exist positive constants $c_{1},c_{2}$ and $n_{0}\in\mathbb{N}$ such that $c_{1}g(n)\leq f(n)\leq c_{2}g(n)$ for any $n\geq n_{0}$ ; $f(n)=\Omega(g(n))$ if and only if $g(n)=O(f(n))$ ; and $f(n)=\omega(g(n))$ if and only if $\lim_{n\rightarrow\infty}f(n)/g(n)=+\infty$ . For a matrix $\mathsf{M}$ , we denote by $\mathsf{M}^{\prime}$ its transpose, by $\|\mathsf{M}\|$ its operator norm (the largest singular value) and by $\mu_{1}(\mathsf{M})\leq\ldots\leq\mu_{n}(\mathsf{M})$ its eigenvalues listed in nondecreasing order. We use $\mathcal{S}^{c}$ to denote the complement of a set $\mathcal{S}$ . All logarithms and exponentials are base $e$ .

II Previous Works

II-A Parameter Estimation

The maximum likelihood (ML) estimate $\hat{a}_{\text{ML}}(u_{1}^{n})$ of the parameter $a$ given samples $u_{1}^{n}=(u_{1},\ldots,u_{n})^{\prime}$ drawn from the Gauss-Markov source is given by

[TABLE]

The derivation of (2) is straightforward, e.g. [48, App. F-A]. The problem is to provide performance guarantees of $\hat{a}_{\text{ML}}(u_{1}^{n})$ . This simply formulated problem has been widely studied in the literature. Our main contribution in this paper is a nonasymptotic fine-grained large deviations analysis of the estimation error.

The estimate $\hat{a}_{\text{ML}}(u_{1}^{n})$ in (2) has been extensively studied in the statistics [4, 6] and economics [2, 3] communities. Mann and Wald [2] and Rubin [3] showed that the estimation error $\hat{a}_{\text{ML}}(U_{1}^{n})-a$ converges to 0 in probability for any $a\in\mathbb{R}$ . Rissanen and Caines [6] later proved that $\hat{a}_{\text{ML}}(U_{1}^{n})-a$ converges to 0 almost surely for $0<a<1$ . To better understand the finer scaling of the error $\hat{a}_{\text{ML}}(U_{1}^{n})-a$ , researchers turned to study the limiting distribution of the normalized estimation error $h(n)(\hat{a}_{\text{ML}}(U_{1}^{n})-a)$ for a careful choice of the standardizing function $h(n)$ :

[TABLE]

With the above choices of $h(n)$ , Mann and Wald [2] and White [4] showed that the distribution of the normalized estimation error $h(n)(\hat{a}_{\text{ML}}(U_{1}^{n})-a)$ converges to $\mathcal{N}(0,1)$ for $|a|<1$ ; to the standard Cauchy distribution for $|a|>1$ ; and for $|a|=1$ , to the distribution of

[TABLE]

where $\{B(t):t\in[0,1]\}$ is a Brownian motion.

Generalizations of the above results in several directions have also been investigated. In [2, Sec. 4], the maximum likelihood estimator for the $p$ -th order stationary autoregressive processes with $Z_{i}$ ’s being i.i.d. zero-mean and bounded moments random variables (not necessarily Gaussian) was shown to be weakly consistent, and the scaled estimation errors $\sqrt{n}(\hat{a}_{j}-a_{j})$ for $j=1,\ldots,p$ were shown to converge in distribution to the Gaussian random variables as $n$ tends to infinity. Anderson [5, Sec. 3] studied the limiting distribution of the maximum likelihood estimator for a nonstationary vector version of the process (1). Chan and Wei [7] studied the performance of the estimation error when $a$ is not a constant but approaches to 1 from below in the order of $1/n$ . Estimating $a$ from a block of outcomes of the Gauss-Markov source (1) is one of the simplest versions of the problem of system identification, where the goal is to learn system parameters of a dynamical system from the observations [54, 55, 56, 57, 10]. One objective of those studies is to obtain tight performance bounds on the least-squares estimates of the system parameters $\mathsf{A},\mathsf{B},\mathsf{C},\mathsf{D}$ from a single input / output trajectory $\{W_{i},Y_{i}\}_{i=1}^{n}$ in the following state-space model, e.g. [55, Eq. (1)–(2)]:

[TABLE]

where $X_{i},W_{i},Z_{i},V_{i}$ ’s are random vectors of certain dimensions and the system parameters $\mathsf{A},\mathsf{B},\mathsf{C},\mathsf{D}$ are matrices of appropriate dimensions. The Gauss-Markov process in (1) can be written as the state-space model by choosing $\mathsf{A}=a$ being a scalar, $\mathsf{B}=\mathsf{D}=0$ , $\mathsf{C}=1$ and $V_{i}=0$ . For stable vector systems, that is, $\|\mathsf{A}\|<1$ , Oymak and Ozay [55, Thm. 3.1] showed that the estimation error in spectral norm is $O(1/\sqrt{n})$ with high probability, where $n$ is the number of samples. For the subclass of the regular unstable systems [57, Def. 3], Faradonbeh et al. [57, Thm. 1] proved that the probability of estimation error exceeding a positive threshold in spectral norm decays exponentially in $n$ . For the Gauss-Markov processes considered in the present paper, Simchowitz et al. [54, Thm. B.1] and Sarkar and Rakhlin [56, Prop. 4.1] presented tail bounds on the estimation error of the ML estimate.

Another line of work closely related to this paper is the large deviation principle (LDP) [58, Ch. 1.2] on $\hat{a}_{\text{ML}}(U_{1}^{n})-a$ . Given an error threshold $\eta>0$ , define $P^{+}(n,a,\eta)$ and $P^{-}(n,a,\eta)$ as follows:

[TABLE]

We also define $P(n,a,\eta)$ as

[TABLE]

The large deviation theory studies the rate functions, defined as the limits of $P^{+}(n,a,\eta)$ , $P^{-}(n,a,\eta)$ and $P(n,a,\eta)$ , as $n$ goes to infinity. Bercu et al. [8, Prop. 8] found the rate function for the case of $0<a<1$ . For $a\geq 1$ , Worms [9, Thm. 1] proved that the rate functions can be bounded from below implicitly by the optimal value of an optimization problem.

These studies of the limiting distribution and the LDP of the estimation error are asymptotic. In this paper, we develop a nonasymptotic analysis of the estimation error. Two nonasymptotic lower bounds on $P^{+}(n,a,\eta)$ and $P^{-}(n,a,\eta)$ are available in the literature. For any $a\in\mathbb{R}$ , Rantzer [10, Th. 4] showed that

[TABLE]

Bercu and Touati [59, Cor. 5.2] proved that

[TABLE]

where $y_{\eta}$ is the unique positive solution to $(1+x)\log(1+x)-x-\eta^{2}=0$ in $x$ . Both bounds (10) and (11) do not capture the dependence on $a$ and $n$ , and are the same for $P^{+}(n,a,\eta)$ and $P^{-}(n,a,\eta)$ . The bounds in [54, 55, 56, 57, 10] either are optimal only order-wise or involve implicit constants. Our main result on parameter estimation is a tight nonasymptotic lower bound on $P^{+}(n,a,\eta)$ and $P^{-}(n,a,\eta)$ . For larger $a$ , the lower bound becomes larger, which suggests that unstable systems are easier to estimate than stable ones, an observation consistent with [54]. The proof is inspired by Rantzer [10, Lem. 5], but our result improves Rantzer’s result (10) and Bercu and Touati’s result (11), see Fig. 1 for a comparison. Most of our results generalize to the case where $Z_{i}$ ’s are i.i.d. sub-Gaussian random variables, see Theorem 4 in Section III-D below.

II-B Nonasymptotic Rate-distortion Theory

The rate-distortion theory studies the problem of compressing a generic random process $\{X_{i}\}_{i=1}^{\infty}$ with minimum distortion. Given a distortion threshold $d>0$ , an excess-distortion probability $\epsilon\in(0,1)$ and the number of codewords $M\in\mathbb{N}$ , an $(n,M,d,\epsilon)$ lossy compression code for a random vector $X_{1}^{n}$ consists of an encoder $\mathsf{f}_{n}\colon\mathbb{R}^{n}\rightarrow[M]$ , and a decoder $\mathsf{g}_{n}\colon[M]\rightarrow\mathbb{R}^{n}$ , such that $\mathbb{P}\left[\mathsf{d}\left(X_{1}^{n},\mathsf{g}_{n}\left(\mathsf{f}_{n}(X_{1}^{n})\right)\right)>d\right]\leq\epsilon$ , where $\mathsf{d}(\cdot,\cdot)$ is the distortion measure. This paper considers the mean squared error (MSE) distortion: $\forall~{}x_{1}^{n},~{}y_{1}^{n}\in\mathbb{R}^{n}$ ,

[TABLE]

The minimum achievable code size and source coding rate are defined respectively by

[TABLE]

In this paper, we approximate the nonasymptotic coding rate $R(n,d,\epsilon)$ for the nonstationary Gauss-Markov source.

Another related and widely studied setting is compression under the average distortion criterion. Given a distortion threshold $d>0$ and the number of codewords $M\in\mathbb{N}$ , an $(n,M,d)$ lossy compression code for a random vector $X_{1}^{n}$ consists of an encoder $\mathsf{f}_{n}\colon\mathbb{R}^{n}\rightarrow[M]$ , and a decoder $\mathsf{g}_{n}\colon[M]\rightarrow\mathbb{R}^{n}$ , such that $\mathbb{E}\left[\mathsf{d}\left(X_{1}^{n},\mathsf{g}_{n}\left(\mathsf{f}_{n}(X_{1}^{n})\right)\right)\right]\leq d$ . Similarly, one can define $M^{\star}(n,d)$ and $R(n,d)$ as the minimum achievable code size and source coding rate, respectively, under the average distortion criterion. The traditional rate-distortion theory [11, 12, 17, 18, 15, 16] showed that the limit of the operational source coding rate $R(n,d)$ as $n$ tends to infinity equals the informational rate-distortion function for a wide class of sources. For discrete memoryless sources, Zhang, Yang and Wei in [23] showed that $R(n,d)$ approaches the rate-distortion function as $\log n/2n+o(\log n/n)$ . For abstract alphabet memoryless sources, Yang and Zhang in [24, Th. 2] showed a similar convergence rate.

Under the excess-distortion probability criterion, one can also study the nonasymptotic behavior of the minimum achievable excess-distortion probability $\epsilon^{\star}(n,d,M)$ :

[TABLE]

Marton’s excess distortion exponent [21, Th. 1, Eq. (2)-(3), (20)] showed that for discrete memoryless sources $P_{X}$ , it holds that

[TABLE]

where the minimization is over all probability distributions $P_{\hat{X}}$ such that $\mathbb{R}_{\hat{X}}(d)\geq\frac{\log M}{n}$ , where $M$ is such that $\frac{\log M}{n}$ is a constant, $\mathbb{R}_{\hat{X}}(d)$ denotes the rate-distortion function of a discrete memoryless source with single-letter distribution $P_{\hat{X}}$ , and $D(\cdot||\cdot)$ denotes the Kullback-Leibler divergence. As pointed out by [25, p. 2], for fixed $d>0$ and $\epsilon\in(0,1)$ , even the limit of $R(n,d,\epsilon)$ as $n$ goes to infinity is unanswered by Marton’s bound in (16). Ingber and Kochman [25] (for finite-alphabet and Gaussian sources) and Kostina and Verdú [26] (for abstract sources) showed that the minimum achievable source coding rate $R(n,d,\epsilon)$ admits the following expansion, known as Gaussian approximation [60].

[TABLE]

where $\mathbb{V}(d)$ is the dispersion of the source (defined as the variance of the tilted information random variable, details later) and $Q^{-1}$ denotes the inverse Q-function. In this paper, by extending our previous analysis [48, Th. 1] of the stationary Gauss-Markov source to the nonstationary one, we establish the Gaussian approximation in the form of (17) for the nonstationary Gauss-Markov sources. One of the key ideas behind this extension is to construct a typical set using the ML estimate of $a$ , and to use our estimation error bound to probabilistically characterize that set.

III Parameter Estimation

III-A Nonasymptotic Lower Bounds

We first present our nonasymptotic bounds on $P^{+}(n,a,\eta)$ and $P^{-}(n,a,\eta)$ , defined in (7) and (8) above, respectively. We define two sequences $\{\alpha_{\ell}\}_{\ell\in\mathbb{N}}$ and $\{\beta_{\ell}\}_{\ell\in\mathbb{N}}$ as follows. Let $\sigma^{2}>0$ and $a>1$ be fixed constants. For $\eta>0$ and a parameter $s>0$ , let $\alpha_{\ell}$ be the following sequence

[TABLE]

Similarly, let $\beta_{\ell}$ be the following sequence

[TABLE]

Note the subtle difference between (19) and (21): there is a negative sign in the numerator in (21). Both sequences depend on $\eta$ and $s$ . We derive closed-form expressions and analyze the convergence properties of $\alpha_{\ell}$ and $\beta_{\ell}$ in Appendices A-B and A-C below. For $\eta>0$ and $n\in\mathbb{N}$ , we define the following sets

[TABLE]

Theorem 1.

For any constant $\eta>0$ , the estimator (2) satisfies for any $n\geq 2$ ,

[TABLE]

where $\alpha_{\ell}$ and $\beta_{\ell}$ are defined in (19) and (21), respectively, and $\mathcal{S}_{n}^{+}$ and $\mathcal{S}_{n}^{-}$ are defined in (22) and (23), respectively.

Theorem 1 is a useful result for numerically computing lower bounds on $P^{+}(n,a,\eta)$ and $P^{-}(n,a,\eta)$ . In Fig. 1, we plot our lower bounds in Theorem 1, previous results in (10) by Rantzer and (11) by Bercu and Touati, and a simulation result. As one can see, our bound in Theorem 1 is much tighter than previous results.

The proof of Theorem 1, presented in Appendix A-A below, is a detailed analysis of the Chernoff bound using the tower property of conditional expectations. The proof is motivated by [10, Lem. 5], but our analysis is more accurate and the result is significantly tighter, see Fig. 1 and Fig. 3 for comparisons. One recovers Rantzer’s lower bound (10) by setting $s=\eta/\sigma^{2}$ and bounding $\alpha_{\ell}$ as $\alpha_{\ell}\leq\alpha_{1}$ (due to the monotonicity of $\alpha_{\ell}$ shown in Appendix A-B below) in Theorem 1. We explicitly state where we diverge from [10, Lem. 5] in the proof in Appendix A-A below.

*Remark 1**.*

In view of the Gärtner-Ellis theorem [58, Th. 2.3.6], we conjecture that the bounds (24) and (25) can be reversed in the limit of large $n$ :

[TABLE]

and similarly for (25).

III-B Asymptotic Lower Bounds

We next present our bounds on the error exponents, that is, the limits of $P^{+}(n,a,\eta)$ , $P^{-}(n,a,\eta)$ and $P(n,a,\eta)$ as $n$ tends to infinity. To take limits using (24) and (25), we need to understand the two sequences of sets $\mathcal{S}_{n}^{+}$ and $\mathcal{S}_{n}^{-}$ . Define the limits of the sets as

[TABLE]

We have the following properties.

Lemma 1.

Fix any constant $\eta>0$ .

•

(Monotone decreasing sets) For any $n\geq 1$ , we have

[TABLE]

•

(Limits of the sets) It holds that

[TABLE]

The proof of Lemma 1 is presented in Appendix A-D below. The exact characterization of $\mathcal{S}_{n}^{+}$ and $\mathcal{S}_{n}^{-}$ for each $n$ using $\eta$ is involved. One can see from the definitions (22) and (23) that

[TABLE]

To obtain the set $\mathcal{S}_{n+1}^{+}$ from $\mathcal{S}_{n}^{+}$ , we need to solve $\alpha_{n+1}<\frac{1}{2\sigma^{2}}$ , which is equivalent to solving an additional inequality involving a polynomial of degree $n+2$ in $s$ (using the closed-form expression for $\alpha_{n+1}$ in (128) in Appendix A-B below). Fig. 2 presents a plot of $\mathcal{S}_{n}^{+}$ for $n=1,...,5$ . Despite the complexity of the sets $\mathcal{S}_{n}^{+}$ and $\mathcal{S}_{n}^{-}$ , Lemma 1 shows their monotonicity property and limits.

Combining Theorem 1 and Lemma 1, we obtain the following lower bounds on the error exponents. The proof is given in Appendix A-E below.

Theorem 2.

Fix any constant $\eta>0$ . For the ML estimator (2), the following three inequalities hold:

[TABLE]

where

[TABLE]

with the thresholds $\eta_{1}$ and $\eta_{2}$ given by

[TABLE]

*Remark 2**.*

The results in (30)-(31) and (33)-(34) indicate the asymmetry between $P^{+}(n,a,\eta)$ and $P^{-}(n,a,\eta)$ : the set $\mathcal{S}_{\infty}^{-}$ has a larger range than $\mathcal{S}_{\infty}^{+}$ , and $I^{+}(a,\eta)>I^{-}(a,\eta)$ , which suggests that the maximum likelihood estimator $\hat{a}_{\text{ML}}(U_{1}^{n})$ is more likely to underestimate $a$ than to overestimate it.

Fig. 3 presents a comparison of (35), Rantzer’s bound (10) and Bercu and Touati (11). Our bound (35) is tighter than both of them for any $\eta>0$ .

III-C Decreasing Error Thresholds

When the number of samples $n$ increases, it is natural to have error threshold $\eta$ decrease. In this section, we consider the regime where the error threshold $\eta=\eta_{n}>0$ is a sequence decreasing to 0. In this setting, Theorem 1 still holds and the proof stays the same, except that we replace $\alpha_{\ell}$ and $\beta_{\ell}$ , by the length- $n$ sequences $\alpha_{n,\ell}$ and $\beta_{n,\ell}$ for $\ell=1,\ldots,n$ , respectively, where $\alpha_{n,\ell}$ and $\beta_{n,\ell}$ now depend on $\eta_{n}$ instead of a constant $\eta$ :

[TABLE]

The sequence $\beta_{n,\ell}$ is defined in a similar way. For Theorem 2 to remain valid, we require $\eta_{n}$ no smaller than $1/\sqrt{n}$ to ensure that the right sides of (24)-(25) still converge to the right sides of (33)-(34), respectively. Let $\eta_{n}$ be a positive sequence such that

[TABLE]

Theorem 3.

For any $\sigma^{2}>0$ and $a>1$ , let $\eta_{n}>0$ be a positive sequence satisfying (41). Then, Theorem 1 holds with $\alpha_{\ell}$ replaced by $\alpha_{n,\ell}$ , and $\beta_{\ell}$ by $\beta_{n,\ell}$ , and Theorem 2 holds with (33) and (34) replaced, respectively, by

[TABLE]

The proof of Theorem 3 is presented in Appendix A-F below. Theorem 3 is a quite strong result as it states that even if the error threshold is a sequence decreasing to zero, as long as (41) is satisfied, the probability of estimation error exceeding such decreasing thresholds is still exponentially small, with exponent being at least $\log a$ .

Corollary 1.

For any $\sigma^{2}>0$ and any $a>1$ , there exists a constant $c\geq\frac{1}{2}\log(a)$ such that for all $n$ large enough,

[TABLE]

Corollary 1 is used in Section IV-E below to derive the dispersion of nonstationary Gauss-Markov sources. The proof of Corollary 1 is by applying Theorem 3 with $\eta_{n}$ chosen as

[TABLE]

III-D Generalization to sub-Gaussian $Z_{i}$ ’s

In this section, we generalize the above results to the case where $Z_{i}$ ’s in (1) are zero-mean sub-Gaussian random variables. This general result is of independent interest and will not be used in the rest of the paper.

Definition 1 (sub-Gaussian random variable, e.g. [61, Def. 2.7]).

Fix $\sigma>0$ . A random variable $Z\in\mathbb{R}$ with mean $\mu$ is said to be $\sigma$ -sub-Gaussian with variance proxy $\sigma^{2}$ if its moment-generating function (MGF) satisfies

[TABLE]

for all $s\in\mathbb{R}$ .

One important property of $\sigma$ -sub-Gaussian random variables is the following well-known bound on the MGF of quadratic functions of $\sigma$ -sub-Gaussian random variables.

Lemma 2 ([10, Prop. 2]).

Let $Z$ be a $\sigma$ -sub-Gaussian random variable with mean $\mu$ . Then

[TABLE]

for any $s<\frac{1}{2\sigma^{2}}$ .

Equality holds in (46) and (47) when $Z$ is Gaussian. In particular, the right side of (47) is the MGF of the noncentral $\chi^{2}$ -distributed random variable $Z^{2}$ .

Theorem 4 (Generalization to sub-Gaussian case).

Theorems 1–3 and Lemma 1 remain valid for the estimator (2) when $Z_{i}$ ’s in (1) are i.i.d. zero-mean $\sigma$ -sub-Gaussian random variables.

The generalizations of Theorems 1–3 and Lemma 1 from Gaussian to sub-Gaussian $Z_{i}$ ’s only require minor changes in the corresponding proofs. See Appendix A-G for the details.

IV The Dispersion of a Nonstationary Gauss-Markov Source

IV-A Rate-distortion functions

For a generic random process $\{X_{i}\}_{i=1}^{\infty}$ , the $n$ -th order (informational) rate-distortion function $\mathbb{R}_{X_{1}^{n}}(d)$ is defined as

[TABLE]

where $X_{1}^{n}\triangleq(X_{1},\ldots,X_{n})^{\prime}$ is the $n$ -dimensional random vector determined by the random process, $I(X_{1}^{n};Y_{1}^{n})$ is the mutual information between $X_{1}^{n}$ and $Y_{1}^{n}$ , $d$ is a given distortion threshold, and $\mathsf{d}\left(\cdot,\cdot\right)$ is the distortion measure defined in (12) in Sec. II-B above. The rate-distortion function $\mathbb{R}_{X}(d)$ is defined as

[TABLE]

For a wide class of sources, $\mathbb{R}_{X}(d)$ has been shown to be equal to the minimum achievable source coding rate under the average distortion criterion, in the limit of $n\to\infty$ , see [11] for discrete memoryless sources and [12] for general ergodic sources. In particular, Gray’s coding theorem [17, Th. 2] for the Gaussian autoregressive processes directly implies that for the Gauss-Markov source $\{U_{i}\}_{i=1}^{\infty}$ in (1) for any $a\in\mathbb{R}$ , its rate-distortion function $\mathbb{R}_{U}(d)$ equals the minimum achievable source coding rate under the average distortion criterion as $n$ tends to infinity. The $n$ -th order rate-distortion function $\mathbb{R}_{U_{1}^{n}}(d)$ of the Gauss-Markov source is given by the $n$ -th order reverse waterfilling, e.g. [17, Eq. (22)]:

[TABLE]

where $\theta_{n}>0$ is the $n$ -th order water level, and $\mu_{n,i}$ ’s for $i\in[n]$ (sorted in nondecreasing order) are the eigenvalues of the $n\times n$ matrix $\mathsf{F}^{\prime}\mathsf{F}$ with $\mathsf{F}$ being an $n\times n$ lower triangular matrix defined as

[TABLE]

One can check that $\sigma^{2}(\mathsf{F}^{\prime}\mathsf{F})^{-1}$ is the covariance matrix of $U_{1}^{n}$ . The way that one uses (50)-(51) is to first solve the $n$ -th order water level $\theta_{n}$ using (51) for a given distortion threshold $d$ , and then to plug that water level into (50) to obtain $\mathbb{R}_{U_{1}^{n}}(d)$ . The rate-distortion function $\mathbb{R}_{U}(d)$ of the Gauss-Markov source is given by the limiting reverse waterfilling:

[TABLE]

where $\theta>0$ is the limiting water level and $g(w)$ is a function from $[-\pi,\pi]$ to $\mathbb{R}$ given by

[TABLE]

The rate-distortion function of the Gaussian memoryless source $\{Z_{i}\}_{i=1}^{\infty}$ (the special case when $a$ is set to 0 in the Gauss-Markov model) is [11]

[TABLE]

One can obtain (56) from (53)-(54) by noting that $g(w)=1$ for $a=0$ , which further simplifies (54) to $d=\theta$ , and (53) to (56). See Fig. 4 for a plot of $\mathbb{R}_{U}(d)$ and $\mathbb{R}_{Z}(d)$ .

IV-B Operational Dispersion

To characterize the convergence rate of the minimum achievable source coding rate $R(n,d,\epsilon)$ (defined in (14) in Section II-B above) to the rate-distortion function, we define the operational dispersion $V_{U}(d)$ for the Gauss-Markov source as

[TABLE]

where $Q^{-1}$ denotes the inverse Q-function. The main result in the second part of this paper gives $V_{U}(d)$ for the nonstationary Gauss-Markov source.

IV-C Informational Dispersion

The $\mathsf{d}$ -tilted information [26, Def. 6] is the key random variable in our nonasymptotic analysis of $R(n,d,\epsilon)$ . Under other names, the $\mathsf{d}$ -tilted information has also been studied by Blahut [62, Th. 4] and Kontoyiannis [37, Sec. III-A]. Using the definition in [26, Def. 6], the $\mathsf{d}$ -tilted information $\jmath_{U_{1}^{n}}(u_{1}^{n},d)$ in $u_{1}^{n}$ is

[TABLE]

where $\lambda_{n}^{\star}$ is the negative slope of $\mathbb{R}_{U_{1}^{n}}(d)$ at the distortion level $d$ and $V_{1}^{\star n}$ is the random variable that achieves the infimum in (48) for $U_{1}^{n}$ . In [48, Lem. 7, Eq. (228)], by a decorrelation argument, we obtained the following expression for the $\mathsf{d}$ -tilted information for the Gauss-Markov source: for any $a\in\mathbb{R}$ and any $n\in\mathbb{N}$ ,

[TABLE]

where $\theta_{n}>0$ is given by (51), $x_{1}^{n}\triangleq\mathsf{S}^{\prime}u_{1}^{n}$ with $\mathsf{S}$ being an $n\times n$ orthonormal matrix that diagonalizes $(\mathsf{F}^{\prime}\mathsf{F})^{-1}$ , and

[TABLE]

with $\mu_{n,i}$ ’s being the eigenvalues of the $n\times n$ matrix $\mathsf{F}^{\prime}\mathsf{F}$ . We refer to the random variable $X_{1}^{n}$ , defined by

[TABLE]

as the decorrelation of $U_{1}^{n}$ . Note that the decorrelation $X_{1}^{n}$ has independent coordinates and

[TABLE]

Using (50)-(51) and (62), one can show [48, Eq. (55) and (228)] that the $\mathsf{d}$ -tilted information $\jmath_{U_{1}^{n}}(u_{1}^{n},d)$ in $u_{1}^{n}$ for the Gauss-Markov source satisfies $\jmath_{U_{1}^{n}}(u_{1}^{n},d)=\jmath_{X_{1}^{n}}(x_{1}^{n},d)$ . The minimum achievable source coding rates (defined in (14)) for lossy compression of $U_{1}^{n}$ and $X_{1}^{n}$ are equal, as are their rate-distortion functions: $\mathbb{R}_{U_{1}^{n}}(d)=\mathbb{R}_{X_{1}^{n}}(d)$ , see [48, Sec. III.A] for the details. It is known [26, Property 1] that the $\mathsf{d}$ -tilted information $\jmath_{U_{1}^{n}}(u_{1}^{n},d)$ satisfies (by the Karush-Kuhn-Tucker conditions for the optimization problem (48))

[TABLE]

The informational dispersion $\mathbb{V}_{U}(d)$ is defined as the limit of the variance of the $\mathsf{d}$ -tilted information normalized by $n$ :

[TABLE]

By decorrelating the Gauss-Markov source $U_{1}^{n}$ and analyzing the limiting behavior of the eigenvalues of the covariance matrix of $U_{1}^{n}$ , we obtain the following reverse waterfilling representation for the informational dispersion. The proof is given in Appendix B-A below.

Lemma 3.

The informational dispersion of the nonstationary Gauss-Markov source is given by

[TABLE]

where $\theta>0$ is given in (54), and $g$ is in (55).

Notice that the informational dispersion in the nonstationary case is given by the same expression as in the stationary case [48, Eq. (57)]. It is known, e.g. [26, Eq. (94)] and [25, Sec. IV], that the informational dispersion for the Gaussian memoryless source $\{Z_{i}\}_{i=1}^{\infty}$ is

[TABLE]

See Fig. 5 for a plot of $\mathbb{V}_{U}(d)$ and $\mathbb{V}_{Z}(d)$ .

IV-D A Few Remarks

In view of (54), there are two special water levels $\theta_{\min}$ and $\theta_{\max}$ , defined as follows:

[TABLE]

and

[TABLE]

The critical distortion $d_{c}$ is defined as the distortion corresponding to the water level $\theta_{\min}$ . By (54), we have

[TABLE]

The maximum distortion $d_{\max}$ is defined as the distortion corresponding to the water level $\theta_{\max}$ . By (54), we have

[TABLE]

Using similar techniques as in [48, Eq. (169)–(172)], one can compute the integral in (70) as

[TABLE]

In this paper, we always consider a fixed distortion threshold $d$ such that $0<d<d_{\max}$ .

*Remark 3**.*

Gray [17, Eq. (24)] showed the following relation between the rate-distortion function $\mathbb{R}_{U}(d)$ of the Gauss-Markov source and $\mathbb{R}_{Z}(d)$ of the Gaussian memoryless source:

[TABLE]

Using Lemma 3 above, one can easily show (in the same way as [48, Cor. 1]) that their dispersions are also comparable:

[TABLE]

The results in (72)-(73) imply that for low distortions $d\in(0,d_{c})$ , the minimum achievable source coding rate in compressing the Gauss-Markov source and the Gaussian memoryless source are the same up to second-order terms, a phenomenon we observed in the stationary case as well [48, Cor. 1]. See Fig. 4 and Fig. 5 for a visualization of (72) and (73), respectively.

*Remark 4**.*

For the function $\mathbb{R}_{U}(d)$ , we show that

[TABLE]

This result has an interesting connection to the problem of control under communication constraints: in [63] [64, Th. 1] [65, Prop. 3.1], it was shown that the minimum rate to asymptotically stabilize a linear, discrete-time, scalar system is also $\log a$ . The result in (74) implies that stability cannot be attained with any rate lower than $\log a$ even if an infinite lookahead is allowed. The derivation of (74) is presented in Appendix B-C below.

*Remark 5**.*

Let $P_{1}$ and $P_{2}$ be the two special points on the curve $\mathbb{V}_{U}(d)$ at distortions $d_{c}$ and $d_{\mathrm{max}}$ , respectively. Then, the coordinates of $P_{1}$ and $P_{2}$ are given by

[TABLE]

The derivation for $P_{2}$ is the same as that in the stationary case [48, Eq. (61)] except that we need to compute the residue at $1/a$ instead of at $a$ since we now have $a>1$ , see [48, App. B-A] for details.

IV-E Second-order Coding Theorem

Our main result establishes the equality between the operational dispersion and the informational dispersion.

Theorem 5 (Gaussian approximation).

For the Gauss-Markov source (1) with $a>1$ , any fixed excess-distortion probability $\epsilon\in(0,1)$ , and distortion threshold $d\in(0,d_{\mathrm{max}})$ , it holds that

[TABLE]

Specifically, we have the following converse and achievability.

Theorem 6 (Converse).

For the Gauss-Markov source with $a>1$ , any fixed excess-distortion probability $\epsilon\in(0,1)$ , and distortion threshold $d$ , the minimum achievable source coding rate $R(n,d,\epsilon)$ satisfies

[TABLE]

where $Q^{-1}$ denotes the inverse Q-function, $\mathbb{R}_{U}(d)$ is the rate-distortion function given in (53), and $\mathbb{V}_{U}(d)$ is the informational dispersion given by Lemma 3 above.

The converse proof is similar to that in the asymptotically stationary case in [48, Th. 7]. See Appendix D for the details.

Theorem 7 (Achievability).

In the setting the Theorem 6, the minimum achievable source coding rate $R(n,d,\epsilon)$ satisfies

[TABLE]

Theorem 5 follows immediately from Theorems 6 and 7. Central to the achievability proof of Theorem 7 is the following random coding bound: there exists an $(n,M,d,\epsilon)$ code such that [26, Cor. 11]

[TABLE]

where the infimization is over all random variables defined on $\mathbb{R}^{n}$ and $\mathcal{B}(u_{1}^{n},d))$ denotes the distortion $d$ -ball around $u_{1}^{n}$ :

[TABLE]

To obtain the achievability in (78) from (79), we need to bound from below the probability $P_{V_{1}^{n}}(\mathcal{B}(U_{1}^{n},d))$ that $V_{1}^{n}$ falls within the distortion $d$ -ball $\mathcal{B}(U_{1}^{n},d)$ , where $V_{1}^{n}$ and $U_{1}^{n}$ are independent, in terms of the informational dispersion. This connection is made via the following second-order refinement of the “lossy AEP” (asymptotic equipartition property [11, Lem. 1] [38, Th. 1] [26, Lem. 2]) that applies to the nonstationary Gauss-Markov sources.

Lemma 4 (Second-order lossy AEP for the nonstationary Gauss-Markov sources).

For the Gauss-Markov source with $a>1$ , let $P_{V_{1}^{\star n}}$ be the random variable that attains the minimum in (48) with $X_{1}^{n}$ there replaced by $U_{1}^{n}$ . It holds that

[TABLE]

where

[TABLE]

and $c_{i}$ ’s, $i=1,...,4$ , are positive constants depending only on $a$ and $d$ .

The proof of Lemma 4 is presented in Appendix F-E below. The proof of Theorem 7, which uses uses the random coding bound (79) and Lemma 4, is presented in Appendix E below.

IV-F The Connection between Lossy AEP and Parameter Estimation

The proof of lossy AEP in the form of Lemma 4 is technical even for stationary memoryless sources [26, Lem. 2]. A lossy AEP for stationary $\alpha$ -mixing processes was derived in [38, Cor. 17]. For stationary memoryless sources with single-letter distribution $P_{X}$ , the idea in [26, Lem. 2] is to form a typical set $\mathcal{F}_{n}$ of source outcomes [26, Lem. 4] using the product of the empirical distributions [26, Eq. (270)]: $P_{\hat{X}}\times\ldots\times P_{\hat{X}}$ , where $P_{\hat{X}}(x)\triangleq\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}\{x_{i}=x\}$ is the empirical distribution of a given source sequence $x_{1}^{n}$ , and then to show that the inequality inside the bracket in (81) holds for $x_{1}^{n}\in\mathcal{F}_{n}^{c}$ and that the probability of the complement set $\mathcal{F}_{n}^{c}$ is at most $1/q(n)$ , where $p(n)=C\log n+c$ and $q(n)=K/\sqrt{n}$ [26, Lem. 2]. The Gauss-Markov source is not memoryless, and it is nonstationary for $a>1$ . To form a typical set of source outcomes, we define the following proxy random variables using the estimator $\hat{a}_{\text{ML}}(u_{1}^{n})$ in (2).

Definition 2 (Proxy random variables).

For each sequence $u_{1}^{n}$ of length $n$ generated by the Gauss-Markov source, define the proxy random variable $\hat{X}_{1}^{n}$ as an $n$ -dimensional Gaussian random vector with independent coordinates, each of which follows the distribution $\mathcal{N}(0,\hat{\sigma}_{n,i}^{2})$ with

[TABLE]

where $\hat{a}_{\text{ML}}(u_{1}^{n})$ is in (2) above.

*Remark 6**.*

The proxy random variable in Definition 2 differs from that in [48, Eq. (119)] for the stationary case in the behavior of the largest variance $\hat{\sigma}_{n,1}^{2}$ . For each realization $u_{1}^{n}$ , we construct the Gaussian random vector $\hat{X}_{1}^{n}$ according to (84)-(85), which is a proxy to the decorrelation $X_{1}^{n}$ in (61) above. The variances of $\hat{X}_{i}$ and $X_{i}$ are very close due to the closeness of $\hat{a}_{\text{ML}}(u_{1}^{n})$ to $a$ (Corollary 1).

*Remark 7**.*

Since the proxy random variable $\hat{X}_{1}^{n}$ depends on the realization of $U_{1}^{n}$ , Definition 2 defines the joint distribution of $(X_{1}^{n},\hat{X}_{1}^{n})$ , where $X_{1}^{n}$ is the decorrelation of $U_{1}^{n}$ in (61) above.

The following convex optimization problem will be instrumental: for two generic random vectors $A_{1}^{n}$ and $B_{1}^{n}$ with distributions $P_{A_{1}^{n}}$ and $P_{B_{1}^{n}}$ , respectively, define

[TABLE]

where $D(P_{F_{1}^{n}|A_{1}^{n}}||P_{B_{1}^{n}}|P_{A_{1}^{n}})$ is the conditional relative entropy. See Appendix F-B for detailed discussions on this optimization problem.

For each realization $u_{1}^{n}$ (equivalently, each $x_{1}^{n}=\mathsf{S}^{\prime}u_{1}^{n}$ with the $n\times n$ matrix $\mathsf{S}$ defined in the text above (60)), we define $n$ random variables $m_{i}(u_{1}^{n})~{},i=1,\ldots,n$ as follows.

•

Let $X_{1}^{n}$ be the decorrelation of $U_{1}^{n}$ in (61) above. Let $Y_{1}^{\star n}$ be the random variable that attains the infimum in $\mathbb{R}_{X_{1}^{n}}(d)$ .

•

For each $u_{1}^{n}$ , choose $A_{1}^{n}$ in (86) to be the proxy random variable $\hat{X}_{1}^{n}$ , and choose $B_{1}^{n}$ to be $Y_{1}^{\star n}$ . Let $\hat{F}_{1}^{\star n}$ be the random variable that attains the infimum in $\mathbb{R}(\hat{X}_{1}^{n},Y_{1}^{\star n},d)$ .

Then, for each $i=1,\ldots,n$ , define

[TABLE]

Denote

[TABLE]

The typical set for the Gauss-Markov source is then defined as follows.

Definition 3 (Typical set).

For any $d\in(0,d_{\mathrm{max}})$ , $n\geq 2$ and a constant $p>0$ , define $\mathcal{T}(n,p)$ to be the set of vectors $u_{1}^{n}\in\mathbb{R}^{n}$ that satisfy the following conditions:

[TABLE]

where $x_{1}^{n}=\mathsf{S}^{\prime}u_{1}^{n}$ is the decorrelation (61) and $\sigma_{n,i}^{2}$ ’s are defined in (60) above.

The typical set in Definition 3 is in the same form as that in the stationary case [48, Def. 2], but the definitions of proxy random variables and the analyses are different.

Theorem 8.

For any $d\in(0,d_{\mathrm{max}})$ , there exists a constant $p>0$ such that the probability that the Gauss-Markov source produces a typical sequence satisfies

[TABLE]

Corollary 1 is essential to the proof of Theorem 8. See the details in Appendix F-C.

Let $\mathcal{E}$ denote the event inside the square bracket in (81). To prove Lemma 4, we intersect $\mathcal{E}$ with the typical set $\mathcal{T}(n,p)$ and the complement $\mathcal{T}(n,p)^{c}$ , respectively, and then we bound the probability of the two intersections separately. See Appendix F-E for the details.

V Discussion

V-A Stationary and Nonstationary Gauss-Markov Processes

It took several decades [13, 15, 17, 22, 19] to completely understand the difference in rate-distortion functions between stationary and nonstationary Gaussian autoregressive sources. We briefly summarize this subtle difference here to make the point that generalizing results from the stationary case to the nonstationary one is natural but nontrivial.

Since $\det(\mathsf{F})=1$ , the eigenvalues $\mu_{n,i}$ ’s of $\mathsf{F}^{\prime}\mathsf{F}$ satisfy

[TABLE]

Using (93), we can equivalently rewrite (50) as

[TABLE]

where $\theta_{n}>0$ is in (51) and $\sigma_{n,i}^{2}$ ’s are in (60). Both (50) and (94) are valid expressions for the $n$ -th order rate-distortion function $\mathbb{R}_{U_{1}^{n}}(d)$ , regardless of whether the source is stationary or nonstationary. The classical Kolmogorov reverse waterfilling result [13, Eq. (18)], obtained by taking the limit in (94), implies that the rate-distortion function of the stationary Gauss-Markov source ( $0<a<1$ ) is given by (the subscript K stands for Kolmogorov)

[TABLE]

where $\theta>0$ is given in (54) and $g(w)$ is given in (55). While (53) and (54) are valid for both stationary and nonstationary cases, Hashimoto and Arimoto [22] noticed in 1980 that (95) is incorrect for the nonstationary Gaussian autoregressive source. The reason is the different asymptotic behaviors of the eigenvalues $\mu_{n,i}$ ’s of $\mathsf{F}^{\prime}\mathsf{F}$ (52) in the stationary and nonstationary cases: while in the stationary case, the spectrum is bounded away from zero, in the nonstationary case, the smallest eigenvalue $\mu_{n,1}$ approaches 0, causing a discontinuity. By treating that smallest eigenvalue in a special way, Hashimoto and Arimoto [22, Th. 2] showed that

[TABLE]

is the correct rate-distortion function for both stationary and nonstationary Gauss-Markov sources, where the subscript HA stands for the authors of [22]. For the general higher-order Gaussian autoregressive source, the correction term needed in (96) depends on the unstable roots of the characteristic polynomial of the source, see [22, Th. 2] for the details. In 2008, Gray and Hashimoto [19] showed the equivalence between $\mathbb{R}_{\text{HA}}(d)$ in (96), obtained by taking a limit in (94), and Gray’s result $\mathbb{R}_{U}(d)$ in (53), obtained by taking a limit in (50).

The tool that allows one to take limits in (94) and (50) is the following theorem on the asymptotic eigenvalue distribution of the almost Toeplitz matrix $\mathsf{F}^{\prime}\mathsf{F}$ , which is the (rescaled) inverse of the covariance matrix of $U_{1}^{n}$ . Denote

[TABLE]

and

[TABLE]

Gray [66, Th. 2.4] generalized the result of Grenander and Szegö [67, Th. in Sec. 5.2] on the asymptotic eigenvalue distribution of Toeplitz forms to that of matrices that are asymptotically equivalent to Toeplitz forms, see [66, Chap. 2.3] for the details. Define

[TABLE]

Theorem 9 (Gray [17, Eq. (19)], Hashimoto and Arimoto [22, Th. 1]).

For any continuous function $F(t)$ over the interval

[TABLE]

the eigenvalues $\mu_{n,i}$ ’s of $\mathsf{F}^{\prime}\mathsf{F}$ with $\mathsf{F}$ in (52) satisfy

[TABLE]

where $g(w)$ is defined in (55).

The eigenvalues $\mu_{n,i}$ ’s behave quite differently in the following three cases, leading to the subtle difference in the corresponding rate-distortion functions.

For the stationary case $a\in(0,1)$ , it can be easily shown [48, Eq. (71)] that $\alpha^{\prime}=\alpha>0$ and all eigenvalues $\mu_{n,i}$ ’s lie in between $\alpha$ and $\beta$ . Kolmogorov’s formula (95) is obtained by applying Theorem 9 to (94) using the function

[TABLE]

where $\theta>0$ is given by (54). 2. 2.

For the Wiener process ( $a=1$ ), closed-form expressions of $\mu_{n,i}$ ’s are given by Berger [15, Eq. (2)]. Those results imply that the smallest eigenvalue $\mu_{n,1}$ is of order $\Theta\left(\frac{1}{n^{2}}\right)$ , and thus $\alpha^{\prime}=\alpha=0$ . Using the same function as in (102), Berger obtained the rate-distortion functions for the Wiener process [15, Eq. 4] 111To be precise, although the rate-distortion function for the Wiener process is correct in [15, Eq. 4], the proof there is not rigorous since in this case $\alpha^{\prime}=\alpha=0$ but $F_{\text{K}}(t)$ is not continuous at $t=0$ as pointed out in [19, Eq. (23)]. Therefore, the limit leading to [15, Eq. 4] needs extra justifications.. 3. 3.

For the nonstationary case $a>1$ , we have $\alpha^{\prime}=0<\alpha$ , the smallest eigenvalue $\mu_{n,1}$ is of order $\Theta(a^{-2n})$ and the other $n-1$ eigenvalues lie in between $\alpha$ and $\beta$ . This behavior of eigenvalues was shown by Hashimoto and Arimoto [22, Lemma] for higher-order Gaussian autoregressive sources, and we will show a refined version for the Gauss-Markov source in Lemma 5 below. As pointed out in [22, Th. 1], an application of Theorem 9 using the function (102) fails to yield the correct rate-distortion function for nonstationary sources due to the discontinuity of $F_{\text{K}}(t)$ at 0. Gray [17, Eq. (22)] and Hashimoto and Arimoto [22] circumvent this difficulty in two different ways, which lead to (53) and (96), respectively. Gray [17] applied Theorem 9 on (50) using the function

[TABLE]

which is indeed continuous at [math], while Hashimoto and Arimoto [22, Th. 2] still use the function $F_{\text{K}}(t)$ but consider $\mu_{n,1}$ and $\mu_{n,i},~{}i\geq 2$ separately:

[TABLE]

which in the limit yields (96) by plugging $\mu_{n,1}=\Theta(a^{-2n})$ into (102).

V-B New Results on the Spectrum of the Covariance Matrix

The following result on the scaling of the eigenvalues $\mu_{n,i}$ ’s refines [22, Lemma]. Its proof is presented in Appendix B-D.

Lemma 5.

Fix $a>1$ . For any $i=2,\ldots,n$ , the eigenvalues of $\mathsf{F}^{\prime}\mathsf{F}$ (52) are bounded as

[TABLE]

where

[TABLE]

The smallest eigenvalue is bounded as

[TABLE]

where $c_{1}>0$ and $c_{2}$ are constants given by

[TABLE]

*Remark 8**.*

The constant $c_{1}$ in (108) is positive, while $c_{2}$ in (109) can be positive, zero or negative, depending on the value of $a>1$ . Lemma 5 indicates that $a^{-2n}$ is a good approximation to $\mu_{n,1}$ . Using (105)–(106), we deduce that for $i=2,\ldots,n$ ,

[TABLE]

Based on Lemma 5, we obtain a nonasymptotic version of Theorem 9, which is useful in the analysis of the dispersion, in particular, in deriving Proposition 1 in Appendix C-A below.

Theorem 10.

Fix any $a>1$ . For any bounded, $L$ -Lipschitz and nondecreasing function (or nonincreasing function) $F(t)$ over the interval (100) and any $n\geq 1$ , the eigenvalues $\mu_{n,i}$ ’s of $\mathsf{F}^{\prime}\mathsf{F}$ (52) satisfy

[TABLE]

where $g(w)$ is defined in (55) and $C_{L}>0$ is a constant that depends on $L$ and the maximum absolute value of $F$ .

The proof of Theorem 10 is in Appendix B-E.

VI Conclusion

In this paper, we obtain nonasymptotic (Theorem 1) and asymptotic (Theorem 2) bounds on the estimation error of the maximum likelihood estimator of the parameter $a$ of the nonstationary scalar Gauss-Markov process. Numerical simulations in Fig. 1 confirm the tightness of our estimation error bounds compared to previous works. As an application of the estimation error bound (Corollary 1), we find the dispersion for lossy compression of the nonstationary Gauss-Markov sources (Theorems 6 and 7). Future research directions include generalizing the error exponent bounds in this paper, applicable to identification of scalar dynamical systems, to vector systems, and finding the dispersion of the Wiener process.

Appendix A

A-A Proof of Theorem 1

Proof.

We present the proof of (24). The proof of (25) is similar and is omitted. For any $n\geq 2$ , denote by $\mathcal{F}_{n}$ the $\sigma$ -algebra generated by $Z_{1},\ldots,Z_{n}$ . For any $s>0$ , $\eta>0$ , and $n\geq 2$ , we denote the following random variable

[TABLE]

By the Chernoff bound, we have

[TABLE]

To compute $\mathbb{E}\left[W_{n}\right]$ , we first consider the conditional expectation $\mathbb{E}\left[W_{n}|\mathcal{F}_{n-1}\right]$ . Since $Z_{n}$ is the only term in $W_{n}$ that does not belong to $\mathcal{F}_{n-1}$ , we have

[TABLE]

where $\alpha_{1}$ is the deterministic function of $s$ and $\eta$ defined in (18), and (115) follows from the moment generating function of $Z_{n}$ . To obtain a recursion, we then consider the conditional expectation $\mathbb{E}\left[W_{n-1}\cdot\exp\left(\alpha_{1}U_{n-1}^{2}\right)|\mathcal{F}_{n-2}\right]$ . Since $U_{n-1}^{2}$ and $U_{n-2}Z_{n-1}$ are the only two terms in $W_{n-1}\cdot\exp(\alpha_{1}U_{n-1}^{2})$ that do not belong to $\mathcal{F}_{n-2}$ , we use the relation $U_{n-1}=aU_{n-2}+Z_{n-1}$ and we complete squares in $Z_{n-1}$ to obtain

[TABLE]

Furthermore, using the formula for the moment generating function of the noncentral $\chi^{2}$ -distributed random variable

[TABLE]

with 1 degree of freedom, we obtain

[TABLE]

This is where our method diverges from Rantzer [10, Lem. 5], who chooses $s=\frac{\eta}{\sigma^{2}}$ and bounds $\alpha_{2}\leq\alpha_{1}$ (due to Property A4 in Appendix A-B below) in (118). Instead, by conditioning on $\mathcal{F}_{n-3}$ in (118) and repeating the above recursion for another $n-2$ times, we compute $\mathbb{E}\left[W_{n}\right]$ exactly using the sequence $\{\alpha_{\ell}\}$ :

[TABLE]

If $s\not\in\mathcal{S}_{n}^{+}$ , then by the definition of the set $\mathcal{S}_{n}^{+}$ we have $\mathbb{E}\left[W_{n}\right]=+\infty$ . Therefore,

[TABLE]

∎

A-B Properties of the Sequence $\alpha_{\ell}$

We derive several important elementary properties of the sequences $\alpha_{\ell}$ and $\beta_{\ell}$ . First, we consider $\alpha_{\ell}$ . We find the two fixed points $r_{1}<r_{2}$ of the recursive relation (19) by solving the following quadratic equation in $x$ :

[TABLE]

Property A1

For any $s>0$ and $\eta>0$ , (121) has two roots $r_{1}<r_{2}$ , and $r_{1}<0$ . The two roots $r_{1}$ and $r_{2}$ are given by

[TABLE]

where $\Delta$ denotes the discriminant of (121):

[TABLE]

Proof.

Note that the discriminant $\Delta$ satisfies

[TABLE]

where we used $a>1$ . Then, (122) implies $r_{1}<0$ . ∎

Property A2

For $\frac{2\eta}{\sigma^{2}}\neq s>0$ and $\eta>0$ , the sequence $\frac{\alpha_{\ell}-r_{1}}{\alpha_{\ell}-r_{2}}$ is a geometric sequence with common ratio

[TABLE]

Furthermore,

[TABLE]

and it follows immediately that

[TABLE]

Proof.

Using the recursion (19) and the fact that $r_{1}$ and $r_{2}$ are the fixed points of (19), one can verify that $\frac{\alpha_{\ell}-r_{1}}{\alpha_{\ell}-r_{2}}$ is a geometric sequence with common ratio $q$ given by (126). The relation (127) is verified by direct computations using (122) and (123). ∎

Property A3

For any $\frac{2\eta}{\sigma^{2}}\neq s>0$ and $\eta>0$ , we have

[TABLE]

For $s=\frac{2\eta}{\sigma^{2}}$ , we have $\alpha_{\ell}=0=r_{2}>r_{1},~{}\forall\ell\geq 1$ .

Proof.

The limit (130) follows from (127) and (128). Plugging $s=\frac{2\eta}{\sigma^{2}}$ into (18) yields $\alpha_{1}=0$ , which implies by (19) that $\alpha_{\ell}=0$ for $\ell\geq 1$ . ∎

Property A4

For any $0<s\leq\frac{2\eta}{\sigma^{2}}$ , we have $\alpha_{\ell}<0$ and $\alpha_{\ell}$ decreases to $r_{1}$ geometrically. For $s>\frac{2\eta}{\sigma^{2}}$ , (130) still holds, but the convergence is not monotone: there exists an $\ell^{\star}\geq 1$ such that $\alpha_{\ell}>0$ and increases to $\alpha_{\ell^{\star}}$ for $1\leq\ell\leq\ell^{\star}$ ; and $\alpha_{\ell}<0$ and increases to $r_{1}$ for $\ell>\ell^{\star}$ .

Proof.

Due to (129), the monotonicity of $\alpha_{\ell}$ depends on the signs of $r_{2}-r_{1}$ and $\frac{\alpha_{1}-r_{1}}{\alpha_{1}-r_{2}}$ . Note that $r_{2}-r_{1}>0$ by Property A1. Plugging $x=\alpha_{1}$ into (121), we have

[TABLE]

Since for $0<s\leq\frac{2\eta}{\sigma^{2}}$ , we have $\alpha_{1}<0$ by (18); we must also have $\frac{\alpha_{1}-r_{1}}{\alpha_{1}-r_{2}}<0$ by (131). Due to (128) and (129), this immediately implies that $\alpha_{\ell}$ decreases to $r_{1}$ . Therefore, $\alpha_{\ell}\leq\alpha_{1}<0,~{}\forall\ell\geq 1$ . For any $s>\frac{2\eta}{\sigma^{2}}$ , we have $\alpha_{1}>0$ and $\frac{\alpha_{1}-r_{1}}{\alpha_{1}-r_{2}}>0$ . In fact, since $r_{1}<0$ , we have $\alpha_{1}>r_{2}$ , which implies $\frac{\alpha_{1}-r_{1}}{\alpha_{1}-r_{2}}>1$ . Therefore, the conclusion follows from (129). ∎

Property A5

For any $\eta>0$ , the root $r_{1}$ in (122) is a decreasing function in $s>0$ .

Proof.

Direct computations using (122), (124) and the assumption that $a>1$ . ∎

A-C Properties of the Sequence $\beta_{\ell}$

The sequence $\beta_{\ell}$ is analyzed similarly, although it is slightly more involved than $\alpha_{\ell}$ . We only consider $0<s\leq\frac{2\eta}{\sigma^{2}}$ in the rest of this section. We find the two fixed points $t_{1}<t_{2}$ of the recursive relation (21) by solving the following quadratic equation in $x$ :

[TABLE]

Property B1

For $s=\frac{2\eta}{\sigma^{2}}$ , we have $\beta_{\ell}=0,~{}\forall\ell\geq 1$ . For any $\eta>0$ and $0<s\leq\frac{2\eta}{\sigma^{2}}$ , (132) has two distinct roots $t_{1}<0<t_{2}$ , given by

[TABLE]

where the discriminant $\Gamma$ of (132) is

[TABLE]

Proof.

We verify that $\Gamma>0$ for any $\eta>0$ and $0<s\leq\frac{2\eta}{\sigma^{2}}$ . The reason that $\Gamma>0$ is not as obvious as (125) is due to the subtle difference between (124) and (135) in the negative sign of $a$ . Note that $\Gamma$ in (135) is a quadratic equation in $s$ and the discriminant of $\Gamma$ is given by

[TABLE]

Hence, in general, (135) has two roots (distinct when $\eta\neq\frac{a^{2}-1}{2a}$ ) and $\Gamma$ could be positive or negative. However, an analysis of two cases $(-a+\eta)^{2}-1\geq 0$ and $(-a+\eta)^{2}-1<0$ reveals that $\Gamma>0$ for any $\eta>0$ and $0<s\leq\frac{2\eta}{\sigma^{2}}$ . Therefore, (132) has two distinct roots $t_{1}<t_{2}$ given in (133) and (134) above. From (132), we have $t_{1}t_{2}=\frac{\beta_{1}}{2\sigma^{2}}$ , which is negative for $0<s\leq\frac{2\eta}{\sigma^{2}}$ . Therefore, we have $t_{1}<0<t_{2}$ . ∎

Property B2

For any $\eta>0$ and $0<s\leq\frac{2\eta}{\sigma^{2}}$ , the sequence $\frac{\beta_{\ell}-t_{1}}{\beta_{\ell}-t_{2}}$ is a geometric sequence with common ratio

[TABLE]

In addition, for any $\eta>0$ and $0<s\leq\frac{2\eta}{\sigma^{2}}$ , we also have

[TABLE]

It follows immediately that

[TABLE]

Proof.

Similar to that of Property A2 above for $\alpha_{\ell}$ . ∎

Property B3

For any $\eta>0$ and $0<s\leq\frac{2\eta}{\sigma^{2}}$ , we have $\beta_{\ell}\leq\beta_{1}<0$ , and $\beta_{\ell}$ decreases to $t_{1}$ geometrically:

[TABLE]

Proof.

This can be verified using (139) and (140) by noticing that $t_{2}-t_{1}>0$ and that for $0<s\leq\frac{2\eta}{\sigma^{2}}$ ,

[TABLE]

∎

Property B4

For any constant $a>1$ , the two thresholds $\eta_{1}$ and $\eta_{2}$ , defined in (37) and (38), respectively, satisfy the following Then,

When $0<\eta\leq\eta_{1}$ , the root $t_{1}$ in (133) is an increasing function in $s\in\mathcal{I}_{\eta}$ . 2. 2.

When $\eta\geq\eta_{2}$ , $t_{1}$ is a decreasing function in $s\in\mathcal{I}_{\eta}$ . 3. 3.

When $\eta_{1}<\eta<\eta_{2}$ , $t_{1}$ is a decreasing function in $s\in(0,s^{\star})$ and an increasing function in $s\in\left(s^{\star},\frac{2\eta}{\sigma^{2}}\right)$ , where $s^{\star}$ is the unique solution in the interval $\mathcal{I}_{\eta}$ to

[TABLE]

and $s^{\star}$ is given by

[TABLE]

Proof.

Using (133) and (135), we compute the derivatives of $t_{1}$ as follows:

[TABLE]

To simplify notations, denote by $L(s)$ the first derivative:

[TABLE]

From (145), we have

[TABLE]

and

[TABLE]

where $\eta_{2}^{\prime}$ is given by

[TABLE]

Since $L(s)$ is an increasing function in $s$ due to (146), to determine the monotonicity of $t_{1}$ , we only need to consider the following three cases.

a) When $L(0)\geq 0$ , or equivalently, $0<\eta\leq\eta_{1}$ , we have $L(s)\geq 0$ for any $s\in\mathcal{I}_{\eta}$ . Hence, $t_{1}$ is an increasing function in $s$ .

b) When $L\left(\frac{2\eta}{\sigma^{2}}\right)\leq 0$ , we have $L(s)\leq 0$ for any $s\in\mathcal{I}_{\eta}$ . Hence, $t_{1}$ is a decreasing function in $s$ . We now show that $L\left(\frac{2\eta}{\sigma^{2}}\right)\leq 0$ is equivalent to $\eta\geq\eta_{2}$ . When $\eta\in\left(\frac{a-1}{2},\frac{a+1}{2}\right)$ , we have $L\left(\frac{2\eta}{\sigma^{2}}\right)>0$ by (149) and $\eta>0$ . When $\eta\in\left(0,\frac{a-1}{2}\right)\cup\left(\frac{a+1}{2},+\infty\right)$ , it is easy to see from (149) that $L\left(\frac{2\eta}{\sigma^{2}}\right)\leq 0$ is equivalent to $\eta\in[\eta_{2}^{\prime},a/2]\cup[\eta_{2},+\infty)$ . Hence, the equivalent condition for $L\left(\frac{2\eta}{\sigma^{2}}\right)\leq 0$ is $\eta\in[\eta_{2},+\infty)$ .

c) When $L(0)<0$ and $L\left(\frac{2\eta}{\sigma^{2}}\right)>0$ , or equivalently, $\eta\in(\eta_{1},\eta_{2})$ , solving (143) using (145) yields (144). Since $L(s)$ is monotonically increasing due to (146), we know that $s^{\star}$ given by (144) is the unique solution to (143) in $\mathcal{I}_{\eta}$ , and $L(s)\leq 0$ for $s\in(0,s^{\star}]$ and $L(s)>0$ for $s\in(s^{\star},2\eta/\sigma^{2})$ . ∎

A-D Proof of Lemma 1

Proof.

We first show the monotone decreasing property. The set $\mathcal{S}_{n+1}^{+}$ contains all $s>0$ such that $a_{1},...,a_{n},a_{n+1}$ are all less than $1/2\sigma^{2}$ , while the set $\mathcal{S}_{n}^{+}$ contains all $s>0$ such that $a_{1},...,a_{n}$ are all less than $1/2\sigma^{2}$ , hence $\mathcal{S}_{n+1}^{+}\subseteq\mathcal{S}_{n}^{+}$ . The same argument yields the conclusion for $\mathcal{S}_{n}^{-}$ .

We then prove that $\mathcal{S}_{\infty}^{+}=\left(0,2\eta/\sigma^{2}\right]$ . Property A4 above in Appendix A-B implies that for any $0<s\leq 2\eta/\sigma^{2}$ , we have $\alpha_{\ell}\leq 0<\frac{1}{2\sigma^{2}}$ . Hence $\left(0,2\eta/\sigma^{2}\right]\subseteq\mathcal{S}_{n}^{+}$ for any $n\geq 1$ . To show the other direction, it suffices to show that for any $s>\frac{2\eta}{\sigma^{2}}$ , there exists $n\in\mathbb{N}$ such that $\alpha_{n}\geq\frac{1}{2\sigma^{2}}$ . Let $\ell^{\star}$ be the integer defined in Property A4 above. Then, $\ell^{\star}$ satisfies the following two conditions

[TABLE]

We show that $\alpha_{\ell^{\star}}\geq\frac{1}{2\sigma^{2}}$ , which would complete the proof. Due to $r_{2}-r_{1}>0$ , using (129) and (152), we have

[TABLE]

where (155) 222It is pretty amazing that (155) is in fact an equality. is by plugging (122), (123) and (126) into (154).

Finally, to show (31), for any $0<s\leq 2\eta/\sigma^{2}$ , we have $\beta_{\ell}\leq 0<\frac{1}{2\sigma^{2}},~{}\forall\ell\geq 1$ , hence $\left(0,2\eta/\sigma^{2}\right]\subseteq\mathcal{S}_{\infty}^{-}$ . The other direction cannot hold since there are many counterexamples, e.g., $a=1.2$ , $\sigma^{2}=1$ , $\eta=0.15$ and $s=0.35>\frac{2\eta}{\sigma^{2}}$ , where the sequence $\beta_{\ell}$ increases monotonically to $t_{1}\approx 0.0411<\frac{1}{2\sigma^{2}}$ . Hence, in this case, $0.35\in\mathcal{S}_{\infty}^{-}$ but $0.35\not\in\left(0,\frac{2\eta}{\sigma^{2}}\right]$ . ∎

A-E Proof of Theorem 2

Proof.

Theorem 1 and Lemma 1 imply that for any $s\in\mathcal{I}_{\eta}$ ,

[TABLE]

Recall that $\alpha_{\ell}$ depends on $s$ . By (130), the continuity of the function $x\mapsto\log(1-x)$ and the Cesàro mean convergence, we have

[TABLE]

where $r_{1}$ depends on $s$ via (122). Since (157) holds for any $s\in\mathcal{I}_{\eta}$ , using Property A5 in Appendix A-B above and supremizing (157) over $s\in\mathcal{I}_{\eta}$ , we obtain (33). Specifically, the supremum of (157) over $s\in\mathcal{I}_{\eta}$ is achieved in the limit of $s$ going to the right end point $2\eta/\sigma^{2}$ . Plugging $s=2\eta/\sigma^{2}$ into (122), we obtain the corresponding value for $r_{1}$ :

[TABLE]

which is further substituted into (157) to yield (33).

Similarly, to show (34), using Property B3 in Appendix A-C above, we have

[TABLE]

Then, by Property B4 in Appendix A-C above, the supermizer $s^{\prime}$ in (159) is given by

[TABLE]

where $s^{\star}$ is given by (144). Plugging (160) into (159) yields (34).

Finally, the bound (35) follows from (33) and (34), since

[TABLE]

and

[TABLE]

∎

A-F Proof of Theorem 3

Proof.

For any sequence $\eta_{n}$ , the proof of Theorem 1 in Appendix A-A above remains valid with $\alpha_{\ell}$ replaced by $\alpha_{n,\ell}$ defined in (40) in Section III-C above. We present the proof of (42), and omit that of (43), which is similar. In this regime, for each $n\geq 1$ , the proof of Lemma 1 implies that

[TABLE]

Then, in (24), we choose

[TABLE]

First, using (122)-(123), (126) and the choice (165), we can determine the asymptotic behavior of quantities involved in determining $\alpha_{n,\ell}$ in (128) and (129) (with $\eta$ replaced by $\eta_{n}$ and $s$ replaced by $s_{n}$ ), summarized in TABLE I.

We make two remarks before proceeding further. It can be easily verified from (126) that the common ratio $q$ is a constant belonging to $(0,1)$ and

[TABLE]

Hence, for all large $n$ , $q$ is bounded by positive constants between 0 and 1. Besides, from (122), we have

[TABLE]

Second, from (128), (24) and the choice (165), we have

[TABLE]

where $r_{1},r_{2}$ and $q$ in this regime depend on $\eta_{n}$ with order dependence given in TABLE I above. Using the inequality $\log(1-x)\geq\frac{x}{x-1},~{}\forall x\in(0,1)$ , we have

[TABLE]

Since $1-2\sigma^{2}r_{2}>0$ due to (123), we can further bound $P^{+}(n,a,\eta_{n})$ as

[TABLE]

where in the last step we used the results in TABLE I. Due to the assumption (41) on $\eta_{n}$ and (167), we obtain (42). ∎

A-G Proof of Theorem 4

Proof.

We point out the proof changes in generalizing our results to the sub-Gaussian case. There are two changes to be made in the proof of Theorem 1 in Appendix A-A above: the equality from (114) to (115) is replaced by $\leq$ since $Z_{n}$ is $\sigma$ -sub-Gaussian; the equality in (118) is replaced by $\leq$ due to Lemma 2. The rest of the proof for Theorem 1 remains the same for the sub-Gaussian case. Since Lemma 1 and Theorems 2, 3 depend only on the properties of the sequences $\alpha_{\ell}$ and $\beta_{\ell}$ , and (24)-(25) continue to hold for sub-Gaussian $Z_{n}$ ’s, the proofs of Lemma 1 and Theorems 2, 3 remain exactly the same for the sub-Gaussian case. ∎

Appendix B

B-A Proof of Lemma 3

Proof.

In view of (62), we take the variances of both sides of (59) to obtain

[TABLE]

Note that $\lim_{n\rightarrow\infty}\theta_{n}=\theta$ , where $\theta>0$ is the water level given by (54). Applying Theorem 9 in Section V-A to (173) with the function

[TABLE]

which is continuous at $t=0$ , we obtain (65). ∎

B-B An Integral

We present the computation of an interesting integral that is useful in obtaining the value of $\mathbb{R}_{U}(d_{\mathrm{max}})$ .

Lemma 6.

For any constant $r\in[-1,1]$ , it holds that

[TABLE]

Proof.

Denote

[TABLE]

By Leibniz’s rule for differentiation under the integral sign, we have

[TABLE]

With the change of variable $u=\tan\left(w/2\right)$ and partial-fraction decomposition, we obtain the closed-form solution to the integral in (178):

[TABLE]

It can be easily verified by directly taking derivatives that the right-side of (175) is indeed the antiderivative of (179). ∎

B-C Derivation of $\mathbb{R}_{U}(d_{\mathrm{max}})$ in (74)

We present two ways to obtain (74). The first one is to directly use (96) in Section V-A. For $\theta=\theta_{\max}$ , we have $\mathbb{R}_{\text{K}}(d_{\mathrm{max}})=0$ in (95), then (74) immediately follows from (96). The second method relies on (53). For $\theta=\theta_{\max}$ , observe from (53) that

[TABLE]

Then, computing the integral (180) using Lemma 6 in Appendix B-B yields (74).

B-D Proof of Lemma 5

Proof.

The bound (105) is obtained by partitioning $\mathsf{F}^{\prime}\mathsf{F}$ (52) into its leading principal submatrix of order $n-1$ and then applying the Cauchy interlacing theorem to that partition, see [48, Lem. 1] for details. To obtain (107), observe from (93)

[TABLE]

Combining (181) and (105) yields

[TABLE]

where

[TABLE]

Plugging (106) into (183) and then taking the limit, we obtain

[TABLE]

where the last equality is due to Lemma 6 in Appendix B-B above. In the rest of the proof, we obtain the following refinement of (185): for any $n\geq 1$ ,

[TABLE]

where $c_{1}$ and $c_{2}$ are the constants given by (108) and (109) in Lemma 5, respectively. Then, (107) will follow directly from (182), (186) and (187).

The proofs of the refinements (186) and (187) are similar, and both are based on the elementary relations between Riemann sums and their corresponding integrals. We present the proof of (186), and omit that of (187). Note that the function $h(w)\triangleq\frac{1}{\pi}\log(1+a^{2}-2a\cos(w))$ is an increasing function in $w\in[0,\pi]$ , and its derivative is bounded above by $M_{1}\triangleq\frac{2a}{\pi(a^{2}-1)}$ for any fixed $a>1$ . Therefore, from (106) and (183), we have

[TABLE]

and (186) follows immediately. ∎

B-E Proof of Theorem 10

Proof.

From Lemma 5, we know that $\alpha^{\prime}=0<\alpha$ (recall (97) and (99)). Since $g(w)$ is an even function, we have

[TABLE]

Denote the maximum absolute value of $F$ over the interval (100) by $T>0$ . It is easy to check that the function $F(g(w))$ is $2aL$ -Lipschitz since $F(\cdot)$ is $L$ -Lipschitz and the derivative of $g(w)$ is bounded by $2a$ . For the following Riemann sum

[TABLE]

the Lipschitz property implies that

[TABLE]

For $i\geq 2$ , rewrite (106) and (105) as

[TABLE]

Denote the sum in (111) as

[TABLE]

Then, separating $F(\mu_{n,1})$ from $Q_{n}$ and applying (193), we have

[TABLE]

Therefore, there is a constant $C_{L}>0$ depending on $L$ and $T$ such that (111) holds. ∎

Appendix C

We gather the frequently used notations in this section as follows. For any given distortion threshold $d>0$ ,

•

let $\theta>0$ be the water level corresponding to $d$ in the limiting reverse waterfilling (54);

•

for each $n\geq 1$ , let $\theta_{n}$ be the water level corresponding to $d$ in the $n$ -th order reverse waterfilling (51);

•

let $d_{n}$ be the distortion associated to the water level $\theta$ in the $n$ -th order reverse waterfilling (51).

For clarity, we explicitly write down the relations between $d$ and $\theta_{n}$ , and between $d_{n}$ and $\theta$ :

[TABLE]

where $\sigma_{n,i}^{2}$ ’s are given in (60). Note that $d$ and $\theta$ are constants independent of $n$ , while $d_{n}$ and $\theta_{n}$ are functions of $n$ , and there is no direct reverse waterfilling relation between $d_{n}$ and $\theta_{n}$ . Applying Theorem 9 in Section V-A above to the function $t\mapsto\min(\theta,\sigma^{2}/t)$ , we have

[TABLE]

and

[TABLE]

Theorem 10 in Section V-B then implies that the speed of convergence in (199) and (200) is in the order of $1/n$ .

C-A Expectation and Variance of the $\mathsf{d}$ -tilted Information

Proposition 1.

For any $d\in(0,d_{\mathrm{max}})$ and $n\geq 1$ , let $d_{n}$ be defined in (198) above. Then, the expectation and variance of the $\mathsf{d}$ -tilted information $\jmath_{U_{1}^{n}}(U_{1}^{n},d_{n})$ at distortion level $d_{n}$ satisfy

[TABLE]

where $\mathbb{R}_{U}(d)$ is the rate-distortion function given in (53), $\mathbb{V}_{U}(d)$ is the informational dispersion given in (65) and $C_{1}$ , $C_{2}$ are positive constants.

Proof.

Using the same derivation as that of (59), one can obtain the following representation of the $\mathsf{d}$ -tilted information $\jmath_{U_{1}^{n}}(U_{1}^{n},d_{n})$ at distortion level $d_{n}$ :

[TABLE]

where $X_{1}^{n}$ is the decorrelation of $U_{1}^{n}$ defined in (61). Note that the difference between (59) and (203) is that $\theta_{n}$ is replaced by $\theta$ . Using (62) and taking expectations and variances of both sides of (203), we arrive at

[TABLE]

Applying Theorem 10 in Section V-B to (204) with the function $F_{\text{G}}(t)$ defined in (103) yields (201). Similarly, applying Theorem 10 to (205) with the function (174) yields (202). ∎

Proposition 1 is one of the key lemmas that will be used in both converse and achievability proofs. Proposition 1 and its proof are similar to those of [48, Eq. (95)–(96)]. The difference is that we apply Theorem 10, which is the nonstationary version of [48, Th. 4], to a different function in (204).

C-B Approximation of the $\mathsf{d}$ -tilted Information

The following proposition gives a probabilistic characterization of the accuracy of approximating the $\mathsf{d}$ -tilted information $\jmath_{U_{1}^{n}}\left(U_{1}^{n},d\right)$ at distortion level $d$ using the $\mathsf{d}$ -tilted information $\jmath_{U_{1}^{n}}\left(U_{1}^{n},d_{n}\right)$ at distortion level $d_{n}$ .

Proposition 2.

For any $d\in(0,d_{\mathrm{max}})$ , there exists a constant $\tau>0$ (depending on $d$ only) such that for all $n$ large enough

[TABLE]

where $d_{n}$ is defined in (198).

Proof.

The proof in [48, App. D-B] works for the nonstationary case as well, since the proof [48, App. D-B] only relies on the convergences in (199) and (200) being both in the order of $1/n$ , which continues to hold for the nonstationary case. ∎

*Remark 9**.*

The following high probability set is used in our converse and achievability proofs:

[TABLE]

Proposition 2 implies that $\mathbb{P}[\mathcal{A}]\geq 1-1/n$ for all $n$ large enough.

Appendix D Converse Proof

Proof of Theorem 6.

Using the general converse by Kostina and Verdú [26, Th. 7] and our established Propositions 1 and 2 in Appendix C, the proof is the same as the converse proof in the asymptotically stationary case [48, Th. 7, Eq. (97)–(109)]. For completeness, we give a proof sketch. Choosing $\gamma=(\log n)/2$ and setting $X$ to be $U_{1}^{n}$ in [26, Th. 7], we know that any $(n,M,d,\epsilon)$ code for the Gauss-Markov source must satisfy

[TABLE]

By conditioning on the high probability set $\mathcal{A}$ defined in Remark 9 above, we can further bound $\epsilon$ from below by

[TABLE]

From (203), we know that $\jmath_{U_{1}^{n}}(U_{1}^{n},d_{n})$ is a sum of independent random variables, whose mean and variance are bounded (within the order of $1/n$ due to Proposition 1) by the rate-distortion function $\mathbb{R}_{U}(d)$ and the informational dispersion $\mathbb{V}_{U}(d)$ . Choosing $M$ as in [48, Eq. (103)] and applying the Berry-Esseen theorem to $\jmath_{U_{1}^{n}}(U_{1}^{n},d_{n})$ , we obtain the converse in Theorem 6. ∎

Appendix E Achievability Proof

Proof of Theorem 7.

With our lossy AEP for the nonstationary Gauss-Markov source and Propositions 1 and 2, the proof is similar to the one for the stationary Gauss-Markov source in [48, Sec. V-C]. Here, we streamline the proof. As elucidated in Section IV-E above, the standard random coding argument [26, Cor. 11] implies that for any $n$ , there exists an $(n,M,d,\epsilon^{\prime})$ code such that

[TABLE]

Choosing $V_{1}^{n}$ to be $V_{1}^{\star n}$ (the random variable that attains the minimum in (48) with $X_{1}^{n}$ there replaced by $U_{1}^{n}$ ), the bound (210) can be relaxed to

[TABLE]

To simplify notations, in the following, we denote by $C$ a constant that might be different from line to line. Given any constant $\epsilon\in(0,1)$ , define $\epsilon_{n}$ as

[TABLE]

where $q(n)$ is defined in (83) above. Note that for all $n$ large enough, we have $\epsilon_{n}\in(0,1)$ . We choose $M$ as

[TABLE]

where $p(n)$ is defined in (82) and $\tau$ is from Proposition 2 above. We also define the random variable $G_{n}$ as

[TABLE]

where $d_{n}$ is defined in (198) above. Note that all the randomness in $G_{n}$ is from $U_{1}^{n}$ , hence we will also use the notation $G_{n}(u_{1}^{n})$ to indicate one realization of the random variable $G_{n}$ . By bounding the deterministic part, that is, $\log M$ , of $G_{n}$ using Proposition 1, we know that with probability 1,

[TABLE]

where we use $\mathbb{E}$ and $\mathbb{V}$ to denote the expectation and variance of the informational dispersion $\jmath_{U_{1}^{n}}(U_{1}^{n},d_{n})$ at distortion level $d_{n}$ . Define the set $\mathcal{G}_{n}$ as

[TABLE]

Then, in view of (203), the informational dispersion $\jmath_{U_{1}^{n}}(U_{1}^{n},d_{n})$ is a sum of independent random variables with bounded moments, and we apply the Berry-Esseen theorem to obtain

[TABLE]

We define one more set $\mathcal{L}_{n}$ as

[TABLE]

Then, by the lossy AEP in Lemma 4 in Section IV-E above and Proposition 2, we have

[TABLE]

Finally, for any constant $\epsilon\in(0,1)$ and $n$ large enough, we define $\epsilon_{n}$ as in (212) above and set $M$ as in (213). Then, there exists $(n,M,d,\epsilon^{\prime})$ code such that

[TABLE]

where the last inequality is due to the definition of $\mathcal{L}_{n}$ and (219). By further conditioning on $\mathcal{G}_{n}$ , we conclude that there exists $(n,M,d,\epsilon^{\prime})$ code such that

[TABLE]

Therefore, by the choice of $M$ in (213), the minimum achievable source coding rate $R(n,d,\epsilon)$ must satisfy

[TABLE]

for all $n$ large enough, where $K_{1}>0$ is a universal constant and $K_{2}$ is a constant depending on $\epsilon$ . Here we change from $Q^{-1}(\epsilon_{n})$ to $Q^{-1}(\epsilon)$ using a Taylor expansion. Therefore, Theorem 7 follows immediately from (224) with the choices of $p(n)$ and $q(n)$ given by (82) and (83), respectively, in the lossy AEP in Lemma 4 in Section IV-E above. We have $O(\cdot)$ in (78) since $K_{2}$ could be positive or negative. ∎

Appendix F Proof of Lossy AEP

F-A Notations

For the optimization problem $\mathbb{R}(A_{1}^{n},B_{1}^{n},d)$ in (86), the generalized tilted information defined in [26, Eq. (28)] in $a_{1}^{n}$ (a realization of $A_{1}^{n}$ ) is given by

[TABLE]

where $\delta>0$ and $d\in(0,d_{\mathrm{max}})$ . For properties of the generalized tilted information, see [26, App. D]. For clarity, we list the notations used throughout this section:

$X_{1}^{n}$ denotes the decorrelation of $U_{1}^{n}$ defined in (61); 2. 2.

$\hat{X}_{1}^{n}$ is the proxy random variable of $X_{1}^{n}$ defined in Definition 2 in Section IV-F above; 3. 3.

For $Y_{1}^{\star n}$ that achieves $\mathbb{R}_{X_{1}^{n}}(d)$ in (48), $\hat{F}_{1}^{\star n}$ is the random vector that achieves $\mathbb{R}\left(\hat{X}_{1}^{n},Y_{1}^{\star n},d\right)$ ; 4. 4.

We denote by $\lambda^{\star}_{n}$ the negative slope of $\mathbb{R}_{X_{1}^{n}}(d)$ (the same notation used in (58)):

[TABLE]

It is shown in [48, Lem. 5] that $\lambda^{\star}_{n}$ is related to the $n$ -th order water level $\theta_{n}$ in (51) by

[TABLE]

Given any source outcome $u_{1}^{n}$ , let $x_{1}^{n}$ be the decorrelation of $u_{1}^{n}$ . Define $\hat{\lambda}_{n}$ as the negative slope of $\mathbb{R}(\hat{X}_{1}^{n},Y_{1}^{\star n},d)$ w.r.t. $d$ :

[TABLE] 5. 5.

Comparing the definitions of $\mathsf{d}$ -tilted information and the generalized tilted information, one can see that [48, Eq. (18)]

[TABLE] 6. 6.

Recalling (62) and applying the reverse waterfilling result [68, Th. 10.3.3], we know that the coordinates of $Y_{1}^{\star n}$ are independent and satisfy

[TABLE]

where

[TABLE]

with $\theta_{n}>0$ given in (197).

F-B Parametric Representation of the Gaussian Conditional Relative Entropy Minimization

Various aspects of the optimization problem (86) have been discussed in [48, Sec. II-B]. In particular, let $B_{1}^{\star n}$ be the optimizer of $\mathbb{R}_{A_{1}^{n}}(d)$ , then we have

[TABLE]

where $\mathbb{R}_{A_{1}^{n}}(d)$ is in (48). Another useful result on the optimization problem (86) is the following: when $A_{1}^{n}$ and $B_{1}^{n}$ are independent Gaussian random vectors, the next theorem gives parametric characterizations for the optimizer and optimal value of (86).

Theorem 11.

Let $A_{1},\ldots,A_{n}$ be independent random variables with

[TABLE]

and $B_{1},\ldots,B_{n}$ be independent random variables with

[TABLE]

For any $d$ such that

[TABLE]

we have the following parametric representation for $\mathbb{R}(A_{1}^{n},B_{1}^{n},d)$ :

[TABLE]

where $\lambda>0$ is the parameter. Furthermore, $\lambda$ equals the negative slope of $\mathbb{R}(A_{1}^{n},B_{1}^{n},d)$ w.r.t. $d$ :

[TABLE]

Similar results to Theorem 11 have appeared previously in the literature [43, 24, 38]. See [38, Example 1 and Th. 2] for the case of $n=1$ . For completeness, we present a proof.

Proof.

Fix any $d$ that satisfies (235), and let $\lambda$ be such that (237) is satisfied. Note from (237) that $d$ is a strictly decreasing function in $\lambda$ (unless $\beta_{i}=0$ for all $i\in[n]$ ), hence such $\lambda$ is unique. The upper bound on $d$ in (235) guarantees that $\lambda>0$ . We first show the $\leq$ direction in (236). For $A_{1}^{n}=a_{1}^{n}\in\mathbb{R}^{n}$ , define the conditional distribution $P_{F_{i}|A_{i}=a_{i}}(f_{i})$ as

[TABLE]

We then define the joint distribution $P_{A_{1}^{n},F_{1}^{n}}$ as

[TABLE]

Using (237), we can check that with such a choice of $P_{A_{1}^{n},F_{1}^{n}}$ , the expected distortion between $A_{1}^{n}$ and $F_{1}^{n}$ equals $d$ . The details follow.

[TABLE]

where (243) is from the relation $\mathbb{E}[(X-t)^{2}]=\text{Var}[X]+(\mathbb{E}[X]-t)^{2}$ and (244) is due to (237). Therefore, the choice of $P_{F_{1}^{n}|A_{1}^{n}}$ in (239) and (240) is feasible for the optimization problem in defining $\mathbb{R}(A_{1}^{n},B_{1}^{n},d)$ . Hence,

[TABLE]

It is straightforward to verify that the Kullback-Leibler divergence between two Gaussian distributions $X\sim\mathcal{N}(\mu_{X},\sigma_{X}^{2})$ and $Y\sim\mathcal{N}(\mu_{Y},\sigma_{Y}^{2})$ is given by

[TABLE]

Using (247) and (239), we see that (246) equals the right-hand side of (236). To prove the other direction, we use the Donsker-Varadhan representation of the Kullback-Leibler divergence [69, Th. 3.5]:

[TABLE]

where the supremum is over all functions $g$ from the sample space to $\mathbb{R}$ such that both expectations in (248) are finite. Fix any $P_{F_{1}^{n}|A_{1}^{n}}$ such that $\mathbb{E}[\mathsf{d}\left(A_{1}^{n},F_{1}^{n}\right)]\leq d$ . For any $A_{1}^{n}=a_{1}^{n}$ , in (248), we choose $P$ to be $P_{F_{1}^{n}|A_{1}^{n}=a_{1}^{n}}$ , $Q$ to be $P_{B_{1}^{n}}$ and $g$ to be $g(f_{1}^{n})\triangleq-n\lambda\mathsf{d}(f_{1}^{n},a_{1}^{n})$ for any $f_{1}^{n}\in\mathbb{R}^{n}$ , then we have

[TABLE]

Taking expectations on both sides of (249) with respect to $P_{A_{1}^{n}}$ and then normalizing by $n$ , we have

[TABLE]

Using the formula for the moment generating function for noncentral $\chi^{2}$ distributions, we can compute

[TABLE]

Plugging (251) into (250) and using $\mathbb{E}[\mathsf{d}\left(A_{1}^{n},F_{1}^{n}\right)]\leq d$ , we conclude that $\mathbb{R}(A_{1}^{n},B_{1}^{n},d)$ is greater than or equal to the right-hand side of (236). Finally, (238) is obtained by taking derivative of (236) w.r.t. $d$ , where we need to use the chain rule for derivatives since $\lambda$ is a function of $d$ given by (237). ∎

Our next result states that for fixed $\beta_{i}^{2}$ ’s satisfying certain mild conditions, if we change the variances from $\alpha_{i}^{2}$ ’s to $\hat{\alpha}_{i}^{2}$ ’s, then the perturbation on the corresponding $\lambda$ ’s is controlled by the perturbation on $\alpha_{i}^{2}$ ’s.

Theorem 12 (Variance perturbation).

Let $\alpha_{i}^{2}$ ’s and $\beta_{i}^{2}$ ’s be in (233) and (234) above, respectively. For a fixed $d$ satisfying (235), let $\lambda$ be given by (237). Suppose that $\alpha_{i}^{2}$ ’s and $\beta_{i}^{2}$ ’s are such that both

[TABLE]

and

[TABLE]

are bounded above by positive constants. Let $\hat{A}_{1},\ldots,\hat{A}_{n}$ be independent random variables with

[TABLE]

Let $\hat{\lambda}$ be such that

[TABLE]

Then, there is a constant $C>0$ such that

[TABLE]

Proof.

We can view (237) as an equation of the form $f(\alpha_{1}^{2},\ldots,\alpha_{n}^{2},\lambda)=0$ . Then, by the implicit function theorem, we know that there exists a unique continuously differentiable function $h$ such that

[TABLE]

and

[TABLE]

Hence,

[TABLE]

By the assumptions (252) and (253), we know that there exists a constant $C>0$ such that

[TABLE]

Hence, we have

[TABLE]

∎

F-C Proof of Theorem 8

The proof is similar to [48, Th. 12]. We streamline the proof and point out the differences. We use the notations defined in Appendix F-A above.

Our Corollary 1 implies that for all $n$ large enough the condition (89) is violated with probability at most $2e^{-cn}$ for a constant $c>\log(a)/2$ . This is much stronger than the bound $\Theta\left(1/\text{poly}\log n\right)$ in the stationary case [48, Th. 6].

In view of (62), the random variables $X_{i}/\sigma_{n,i}$ for $i=1,\ldots,n$ , are distributed according to i.i.d. standard normal distributions, and their $2k$ -th moments equal to $(2k-1)!!$ . The Berry-Esseen theorem implies that the condition (90) is violated with probability at most $\Theta\left(1/\sqrt{n}\right)$ . This is the same as in the stationary case [48, Eq. (279)–(280)].

We use the following procedure to show that the condition (91) is violated with probability at most $\Theta\left(1/\log n\right)$ :

•

We approximate $m_{i}(u_{1}^{n})$ by another random variable $\bar{m}_{i}(u_{1}^{n})$ that is easier to analyze.

•

We show that (91) with $m_{i}(u_{1}^{n})$ replaced by $\bar{m}_{i}(u_{1}^{n})$ holds with probability at least $1-\Theta(1/\log n)$ .

•

We then control the difference between $m_{i}(u_{1}^{n})$ and $\bar{m}_{i}(u_{1}^{n})$ .

To carry out the above program, we first give an expression for $m_{i}(u_{1}^{n})$ by applying [48, Lem. 4] (see also the proof of Theorem 11) on $\mathbb{R}(\hat{X}_{1}^{n},Y_{1}^{\star n},d)$ . Note that $\hat{X}_{1}^{n}$ and $Y_{1}^{\star n}$ are Gaussian random vectors with independent coordinates with variances given by (85) and (230), respectively. Then, [48, Lem. 4] implies that the optimizer $P_{\hat{F}_{1}^{\star n}|\hat{X}_{1}^{n}}$ for $\mathbb{R}(\hat{X}_{1}^{n},Y_{1}^{\star n},d)$ satisfies

[TABLE]

where the conditional distributions $\hat{F}_{i}^{\star}|\hat{X}_{i}=\hat{x}_{i}$ are Gaussian:

[TABLE]

where $\nu_{n,i}^{2}$ ’s are defined in (231), and $\hat{\lambda}_{n}$ is defined in (228). Then, using the definition of $m_{i}(u_{1}^{n})$ in (87) and (264), we obtain

[TABLE]

where $x_{1}^{n}=\mathsf{S}^{\prime}u_{1}^{n}$ . The random variable $m_{i}(u_{1}^{n})$ in the form of (265) is hard to analyze since we do not have a simple expression for $\hat{\lambda}_{n}$ . By replacing $\hat{\lambda}_{n}$ with $\lambda^{\star}_{n}$ , we define another random variable $\bar{m}_{i}(u_{1}^{n})$ that turns out to be easier to analyze:

[TABLE]

Plugging (227) and (231) into (266), we obtain

[TABLE]

where $\theta_{n}$ is the $n$ -th order water level in (51) and $x_{1}^{n}=\mathsf{S}^{\prime}u_{1}^{n}$ . The random variable $\bar{m}_{i}(U_{1}^{n})$ is much easier to analyze since $X_{i}/\sigma_{n,i}$ ’s are i.i.d. standard normal random variables. Moreover, in view of (51), their expectations satisfy

[TABLE]

Since $X_{i}/\sigma_{n,i}$ has bounded moments, from the Berry-Esseen theorem, we know that there exists a constant $\omega>0$ such that for all $n$ large enough

[TABLE]

where $\eta_{n}$ is in (88) above, and $C_{1},C_{2}$ are positive constants. In the last step of the program, we control the difference between $m_{i}(U_{1}^{n})$ and $\bar{m}_{i}(U_{1}^{n})$ . From (265)–(266), we have

[TABLE]

For $i=1$ , we have $\nu_{n,1}^{2}=\sigma_{n,1}^{2}-\theta_{n}=\Theta\left(a^{2n}\right)$ , $\hat{\lambda}_{n}=\Theta(1)$ and $\lambda^{\star}_{n}=\Theta(1)$ . This implies that the summands in (270) for $i=1$ are both of the order $O(1/n)$ for any $x_{1}^{2}=O(a^{4n})$ . For $2\leq i\leq n$ , the condition (89) and the variance perturbation result in Theorem 12 imply that every summand in (270) for $i\geq 2$ is in the order of $\eta_{n}$ . Hence, (270) is in the order of $\eta_{n}$ . Finally, combining (269) and (270) implies that conditioning on the conditions (89) and (90), we conclude that (91) is violated with probability at most $\Theta(1/\log n)$ . ∎

F-D Auxiliary Lemmas

Lemma 7 (Lower bound on the probability of distortion balls).

Fix $d\in(0,d_{\mathrm{max}})$ . For any $n$ large enough and any $u_{1}^{n}\in\mathcal{T}(n,p)$ defined in Definition 3 in Section IV-F above, and $\gamma$ defined by

[TABLE]

for a constant $B_{4}>0$ specified in (299), below, it holds that

[TABLE]

where $K_{1}>0$ is a constant and $\hat{F}_{1}^{\star n}$ is in Appendix F-A above.

The proof is in Appendix F-F.

Lemma 8.

Fix $d\in(0,d_{\mathrm{max}})$ and $\epsilon\in(0,1)$ . There exists constants $C$ and $K_{2}>0$ such that for all $n$ large enough,

[TABLE]

where $\lambda^{\star}_{n}$ and $\hat{\lambda}_{n}$ are defined in (226) and (228), respectively.

Proof.

The proof of Lemma 8 is the same as [48, Eq. (314)–(333)] except that we strengthen the right side of [48, Eq. (322)] to be $\Theta(e^{-cn})$ for a constant $c>\log(a)/2$ due to Corollary 1. ∎

F-E Proof of Lemma 4

Using Lemmas 7 and 8 in Appendix F-D above, the proof of Lemma 4 is almost the same as that in the stationary case [48, Eq. (270)-(278)]. For completeness, we sketch the proof here. We weaken the bound [26, Lem. 1] by setting $P_{\hat{X}}$ as $P_{\hat{X}_{1}^{n}}$ and $P_{Y}$ as $P_{Y_{1}^{\star n}}$ to obtain that for any $x_{1}^{n}\in\mathbb{R}^{n}$ ,

[TABLE]

where $\hat{\lambda}_{n}$ in (228) depends on $X_{1}^{n}$ . Let $\mathcal{E}$ denote the event inside the square brackets in (81). Then,

[TABLE]

where

•

(276) is due to (274) and Lemma 7;

•

From (276) to (277), we used the fact that for $u_{1}^{n}\in\mathcal{T}(n,p)$ , $\hat{\lambda}_{n}$ can be bounded by

[TABLE]

where $B_{1}>0$ is a constant and $\theta>0$ is given by (54). The bound (279) is obtained by the same argument as that in the stationary case [48, Eq. (273)]; $\gamma$ is chosen in (271) above; the constants $c_{i}$ ’s, $i=1,...4$ in (82) are chosen as

[TABLE]

where $B_{4}>0$ is given in (299) below, and $K_{1}$ and $C$ are the constants in Lemmas 7 and 8, respectively.

•

(278) is due to Lemma 8 and Theorem 8.

∎

F-F Proof of Lemma 7

Proof.

The proof is similar to the stationary case [48, Lem. 10]. We streamline the proof and point out the differences. Conditioned on $\hat{X}_{1}^{n}=x_{1}^{n}$ , the random variable

[TABLE]

follows a noncentral $\chi^{2}$ -distribution with (at most) $n$ degrees of freedom, since it is shown in [48, Eq. (282) and Lem. 4] that conditioned on $\hat{X}_{1}^{n}=x_{1}^{n}$ , the distribution of the random variable $\hat{F}_{i}^{\star}-x_{i}$ is given by

[TABLE]

where $\nu_{n,i}^{2}$ ’s are given in (231). Then, the conditional expectation is given by

[TABLE]

where $m_{i}(u_{1}^{n})$ is defined in (87) in Section IV-E above. In view of (284), (286) and (91), we expect that $\mathsf{d}\left(x_{1}^{n},\hat{F}_{1}^{\star n}\right)$ concentrates around $d$ conditioned on $\hat{X}_{1}^{n}=x_{1}^{n}$ for $u_{1}^{n}\in\mathcal{T}(n,p)$ . Note that the proof of Theorem 8 related to (91) is different from the one in the stationary case, see Appendix F-C above for the details. To simplify notations, we denote the variances as

[TABLE]

Due to (285) and (91), we see $(\hat{F}_{i}^{\star}-x_{i})^{2}$ ’s have finite second- and third- order absolute moments. That is, we have

[TABLE]

for $u_{1}^{n}\in\mathcal{T}(n,p)$ . Therefore, we can apply the Berry-Esseen theorem. Hence,

[TABLE]

where

•

(291) follows from the Berry-Esseen theorem; $B_{1}>0$ is a constant, and

[TABLE]

is the cumulative distribution function of the standard Gaussian distribution;

•

(292) is due to the mean value theorem and

[TABLE]

•

In (292), $\xi$ satisfies

[TABLE]

By (91) and (289), we see that there is a constant $B_{2}>0$ such that

[TABLE]

Hence, as long as $\gamma$ in (295) satisfies

[TABLE]

where $\eta_{n}$ is defined in (88), there exists a constant $B_{3}>0$ such that

[TABLE]

Let $B_{4}>0$ be a constant such that

[TABLE]

and choose $\gamma$ as in (271), which satisfies (297). Then, plugging the bounds (289), (298), (299) and (271) into (292), we conclude that there exists a constant $K_{1}>0$ such that (292) is further bounded from below by $\frac{K_{1}}{\sqrt{n}}$ . ∎

Bibliography69

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. Tian and V. Kostina, “From parameter estimation to dispersion of nonstationary Gauss-Markov processes,” in Proceedings of 2019 IEEE International Symposium on Information Theory , Paris, France, July 2019, pp. 2044–2048.
2[2] H. B. Mann and A. Wald, “On the statistical treatment of linear stochastic difference equations,” Econometrica, Journal of the Econometric Society , vol. 11, no. 3, pp. 173–220, July 1943.
3[3] H. Rubin, “Consistency of maximum likelihood estimates in the explosive case,” Statistical Inference in Dynamic Economic Models , pp. 356–364, Jan. 1950.
4[4] J. S. White, “The limiting distribution of the serial correlation coefficient in the explosive case,” The Annals of Mathematical Statistics , pp. 1188–1197, Dec. 1958.
5[5] T. W. Anderson, “On asymptotic distributions of estimates of parameters of stochastic difference equations,” The Annals of Mathematical Statistics , pp. 676–687, Sep. 1959.
6[6] J. Rissanen and P. Caines, “The strong consistency of maximum likelihood estimators for ARMA processes,” The Annals of Statistics , pp. 297–315, Mar. 1979.
7[7] N. H. Chan and C.-Z. Wei, “Asymptotic inference for nearly nonstationary AR(1) processes,” The Annals of Statistics , pp. 1050–1063, Sep. 1987.
8[8] B. Bercu, F. Gamboa, and A. Rouault, “Large deviations for quadratic forms of stationary Gaussian processes,” Stochastic Processes and their Applications , vol. 71, no. 1, pp. 75–90, Oct. 1997.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Nonstationary Gauss-Markov Processes:

Abstract

Index Terms:

I Introduction

I-A Overview

I-B Motivations

I-C Notations

II Previous Works

II-A Parameter Estimation

II-B Nonasymptotic Rate-distortion Theory

III Parameter Estimation

III-A Nonasymptotic Lower Bounds

Theorem 1**.**

Remark 1*.*

III-B Asymptotic Lower Bounds

Lemma 1**.**

Theorem 2**.**

Remark 2*.*

III-C Decreasing Error Thresholds

Theorem 3**.**

Corollary 1**.**

III-D Generalization to sub-Gaussian ZiZ_{i}Zi​’s

Definition 1** (sub-Gaussian random variable, e.g. [61, Def. 2.7]).**

Lemma 2** ([10, Prop. 2]).**

Theorem 4** (Generalization to sub-Gaussian case).**

IV The Dispersion of a Nonstationary Gauss-Markov Source

IV-A Rate-distortion functions

IV-B Operational Dispersion

IV-C Informational Dispersion

Lemma 3**.**

IV-D A Few Remarks

Remark 3*.*

Remark 4*.*

Remark 5*.*

IV-E Second-order Coding Theorem

Theorem 5** (Gaussian approximation).**

Theorem 6** (Converse).**

Theorem 7** (Achievability).**

Lemma 4** (Second-order lossy AEP for the nonstationary Gauss-Markov sources).**

IV-F The Connection between Lossy AEP and Parameter Estimation

Definition 2** (Proxy random variables).**

Remark 6*.*

Remark 7*.*

Definition 3** (Typical set).**

Theorem 8**.**

V Discussion

V-A Stationary and Nonstationary Gauss-Markov Processes

Theorem 9** (Gray [17, Eq. (19)], Hashimoto and Arimoto [22, Th. 1]).**

V-B New Results on the Spectrum of the Covariance Matrix

Lemma 5**.**

Remark 8*.*

Theorem 10**.**

VI Conclusion

Appendix A

A-A Proof of Theorem 1

Proof.

A-B Properties of the Sequence αℓ\alpha_{\ell}αℓ​

Property A1

Proof.

Property A2

Proof.

Property A3

Proof.

Property A4

Proof.

Property A5

Proof.

A-C Properties of the Sequence βℓ\beta_{\ell}βℓ​

Property B1

Proof.

Property B2

Proof.

Property B3

Proof.

Theorem 1.

*Remark 1**.*

Lemma 1.

Theorem 2.

*Remark 2**.*

Theorem 3.

Corollary 1.

III-D Generalization to sub-Gaussian $Z_{i}$ ’s

Definition 1 (sub-Gaussian random variable, e.g. [61, Def. 2.7]).

Lemma 2 ([10, Prop. 2]).

Theorem 4 (Generalization to sub-Gaussian case).

Lemma 3.

*Remark 3**.*

*Remark 4**.*

*Remark 5**.*

Theorem 5 (Gaussian approximation).

Theorem 6 (Converse).

Theorem 7 (Achievability).

Lemma 4 (Second-order lossy AEP for the nonstationary Gauss-Markov sources).

Definition 2 (Proxy random variables).

*Remark 6**.*

*Remark 7**.*

Definition 3 (Typical set).

Theorem 8.

Theorem 9 (Gray [17, Eq. (19)], Hashimoto and Arimoto [22, Th. 1]).

Lemma 5.

*Remark 8**.*

Theorem 10.

A-B Properties of the Sequence $\alpha_{\ell}$

A-C Properties of the Sequence $\beta_{\ell}$

Lemma 6.

B-C Derivation of $\mathbb{R}_{U}(d_{\mathrm{max}})$ in (74)

C-A Expectation and Variance of the $\mathsf{d}$ -tilted Information

Proposition 1.

C-B Approximation of the $\mathsf{d}$ -tilted Information

Proposition 2.

*Remark 9**.*

Theorem 11.

Theorem 12 (Variance perturbation).

Lemma 7 (Lower bound on the probability of distortion balls).

Lemma 8.