Nonstationary Gauss-Markov Processes: Parameter Estimation and Dispersion
Peida Tian, Victoria Kostina

TL;DR
This paper analyzes the maximum likelihood estimation error for a nonstationary Gauss-Markov process, providing tight nonasymptotic bounds and applying these results to determine the source dispersion in lossy compression.
Contribution
It introduces a tight nonasymptotic error bound for parameter estimation in nonstationary Gauss-Markov processes and extends dispersion analysis to the nonstationary case.
Findings
Bound on estimation error decays exponentially and is tight for hundreds of samples.
Dispersion formula for nonstationary sources matches that of stationary sources under certain conditions.
New eigenvalue bounding techniques for covariance matrices in nonstationary processes.
Abstract
This paper provides a precise error analysis for the maximum likelihood estimate of the parameter given samples drawn from a nonstationary Gauss-Markov process , where , , and 's are independent Gaussian random variables with zero mean and variance . We show a tight nonasymptotic exponentially decaying bound on the tail probability of the estimation error. Unlike previous works, our bound is tight already for a sample size of the order of hundreds. We apply the new estimation bound to find the dispersion for lossy compression of nonstationary Gauss-Markov sources. We show that the dispersion is given by the same integral formula that we derived previously for the asymptotically stationary Gauss-Markov sources, i.e., . New ideas in the nonstationary case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Nonstationary Gauss-Markov Processes:
Parameter Estimation and Dispersion
Peida Tian, Victoria Kostina P. Tian and V. Kostina are with the Department of Electrical Engineering, California Institute of Technology. (e-mail: {ptian, vkostina}@caltech.edu). This research was supported in part by the National Science Foundation (NSF) under Grant CCF-1751356. A preliminary version [1] of this paper was presented at the 2019 IEEE International Symposium on Information Theory.Copyright ©2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].
Abstract
This paper provides a precise error analysis for the maximum likelihood estimate of the parameter given samples drawn from a nonstationary Gauss-Markov process , where , , and ’s are independent Gaussian random variables with zero mean and variance . We show a tight nonasymptotic exponentially decaying bound on the tail probability of the estimation error. Unlike previous works, our bound is tight already for a sample size of the order of hundreds. We apply the new estimation bound to find the dispersion for lossy compression of nonstationary Gauss-Markov sources. We show that the dispersion is given by the same integral formula that we derived previously for the asymptotically stationary Gauss-Markov sources, i.e., . New ideas in the nonstationary case include separately bounding the maximum eigenvalue (which scales exponentially) and the other eigenvalues (which are bounded by constants that depend only on ) of the covariance matrix of the source sequence, and new techniques in the derivation of our estimation error bound.
Index Terms:
Parameter estimation, maximum likelihood estimator, unstable processes, finite blocklength analysis, lossy compression, sources with memory, rate-distortion theory, system identification, covering in stochastic processes, adaptive control.
I Introduction
I-A Overview
We consider two related problems that concern a scalar Gauss-Markov process , defined by and
[TABLE]
where ’s are independent Gaussian random variables with zero mean and variance .
The first problem is parameter estimation: given samples drawn from the Gauss-Markov source, we seek to design and analyse estimators for the unknown system parameter . The consistency and asymptotic distribution of the maximum likelihood (ML) estimator have been studied in the literature [2, 3, 4, 5, 6, 7]. Our main contribution is a large deviation bound on the estimation error of the ML estimator. Our numerical experiments indicate that our new bound is tighter than previously known results [8, 9, 10].
The second problem is the nonasymptotic performance of the optimal lossy compressor of the Gauss-Markov process. An encoder outputs bits for each realization . Once the decoder receives the bits, it produces as a reproduction of . The distortion between and is measured by the mean squared error (MSE). Two commonly used criteria to quantify the distortion of a lossy compression scheme are the average distortion criterion and the excess-distortion probability criterion. The rate-distortion theory, initiated by Shannon [11] and further pioneered in [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], studies the optimal tradeoff between the rate and the distortion. In the limit of large blocklength , the minimum rate required to achieve average distortion is given by the rate-distortion function. The nonasymptotic version of the rate-distortion problem [21, 23, 24, 25, 26] studies the rate-distortion tradeoff for finite blocklength . Our main contribution is a coding theorem that characterizes the gap between the rate-distortion function and the minimum rate at blocklength for the nonstationary Gauss-Markov source (), under the excess-distortion probability criterion. We leverage our result on the ML estimator to analyze lossy compression. Namely, we apply our bound on the estimation error of the ML estimator to construct a typical set of the sequences whose estimated parameter is close to the true . We then use the typical set in our achievability proof of the nonasymptotic coding theorem.
Without loss of of generality, we assume that in this paper, since, otherwise, we can consider another random process defined by the invertible mapping that satisfies , where ’s are also independent zero-mean Gaussian random variables with variance . We distinguish the following three cases:
- •
: the asymptotically stationary case;
- •
: the unit-root case;
- •
: the nonstationary case.
In this paper, we mostly focus on the nonstationary case.
I-B Motivations
Estimation of parameters of stochastic processes from their realizations has many applications. In the statistical analysis of economic time series [2, 27, 28], the Gauss-Markov process is used to model the varying price of a certain commodity with time, and the ML estimate of the unknown coefficient is then used to predict future prices. In [29] and [30, Sec. 5], the Gauss-Markov process with is used to model the stochastic structure of the velocity of money. The Gauss-Markov process, also known as the autoregressive process of order 1 (AR(1)), is a special case of the general autoregressive-moving-average (ARMA) model [31, 32], for which various estimation and prediction procedures have been proposed, e.g. the Box-Jenkins method [32]. The Gauss-Markov process is also a special case of the linear state-space model (e.g. [33, Chap. 5]) that is popular in control theory. One of the problems in control is system identification [34], which is the problem of building mathematical models using measured data from unknown dynamical systems. Parameter estimation is one of the common methods used in system identification where the dynamical system is modeled by a state-space model [34, Chap. 7] with unknown parameters. In modern data-driven control systems, where the goal is to control an unknown nonstationary system given measured data, parameter estimation methods are used as a first step in designing controllers [10] [35, Sec. 1.2]. In speech signal processing, the linear predictive coding algorithm [36] relies on parameter estimation (the ordinary least squares estimate, or, equivalently, the maximum likelihood estimate assuming Gaussian noise) to fit a higher-order Gauss-Markov process, see [36, App. C]. A fine-grained analysis of the ML estimate is instrumental in optimizing the design of all these systems. Our nonasymptotic analysis leading up to a large deviation bound for the ML estimate in our simple setting can provide insights for analyzing more complex random processes, e.g., higher-order autoregressive processes and vector systems.
Understanding finite-blocklength lossy compression of the Gauss-Markov process fits into a continuing effort by many researchers to advance the rate-distortion theory of information sources with memory, see [13, 14, 15, 17, 18, 20, 22, 37, 38, 39, 40, 41, 19, 42, 43, 44], as well as into a newer push [21, 23, 24, 25, 26, 45, 46, 47, 48, 49, 50] to understand the fundamental limits of low latency communication. There is a tight connection between lossy compression of the nonstationary Gauss-Markov process and control of an unstable linear system under communication constraints [51, 52]. Namely, the minimum channel capacity needed to achieve a given LQG (linear quadratic Gaussian) cost for the plant [51, Eq. (1)] is lower-bounded by the causal rate-distortion function of the Gauss-Markov process [51, Eq. (9)]. See [52, Th. 1] for more details. Being more restrictive on the coding schemes, the causal rate-distortion function is further lower-bounded by the traditional rate-distortion function. The result in this paper on the rate-distortion tradeoff in the finite blocklength regime provides a lower bound on the minimum communication rate required to ensure that the LQG cost stays below a desired threshold with desired probability at the end of a finite horizon. Finally, the aforementioned linear predictive coding algorithm [36] is connected to lossy compression of autoregressive processes, see a recent historical note by Gray [53, p.2].
I-C Notations
For , we use to denote the set . We use the standard notations for the asymptotic behaviors , , and . Namely, let and be two functions of , then means that there exists a constant and such that for any ; means ; means there exist positive constants and such that for any ; if and only if ; and if and only if . For a matrix , we denote by its transpose, by its operator norm (the largest singular value) and by its eigenvalues listed in nondecreasing order. We use to denote the complement of a set . All logarithms and exponentials are base .
II Previous Works
II-A Parameter Estimation
The maximum likelihood (ML) estimate of the parameter given samples drawn from the Gauss-Markov source is given by
[TABLE]
The derivation of (2) is straightforward, e.g. [48, App. F-A]. The problem is to provide performance guarantees of . This simply formulated problem has been widely studied in the literature. Our main contribution in this paper is a nonasymptotic fine-grained large deviations analysis of the estimation error.
The estimate in (2) has been extensively studied in the statistics [4, 6] and economics [2, 3] communities. Mann and Wald [2] and Rubin [3] showed that the estimation error converges to 0 in probability for any . Rissanen and Caines [6] later proved that converges to 0 almost surely for . To better understand the finer scaling of the error , researchers turned to study the limiting distribution of the normalized estimation error for a careful choice of the standardizing function :
[TABLE]
With the above choices of , Mann and Wald [2] and White [4] showed that the distribution of the normalized estimation error converges to for ; to the standard Cauchy distribution for ; and for , to the distribution of
[TABLE]
where is a Brownian motion.
Generalizations of the above results in several directions have also been investigated. In [2, Sec. 4], the maximum likelihood estimator for the -th order stationary autoregressive processes with ’s being i.i.d. zero-mean and bounded moments random variables (not necessarily Gaussian) was shown to be weakly consistent, and the scaled estimation errors for were shown to converge in distribution to the Gaussian random variables as tends to infinity. Anderson [5, Sec. 3] studied the limiting distribution of the maximum likelihood estimator for a nonstationary vector version of the process (1). Chan and Wei [7] studied the performance of the estimation error when is not a constant but approaches to 1 from below in the order of . Estimating from a block of outcomes of the Gauss-Markov source (1) is one of the simplest versions of the problem of system identification, where the goal is to learn system parameters of a dynamical system from the observations [54, 55, 56, 57, 10]. One objective of those studies is to obtain tight performance bounds on the least-squares estimates of the system parameters from a single input / output trajectory in the following state-space model, e.g. [55, Eq. (1)–(2)]:
[TABLE]
where ’s are random vectors of certain dimensions and the system parameters are matrices of appropriate dimensions. The Gauss-Markov process in (1) can be written as the state-space model by choosing being a scalar, , and . For stable vector systems, that is, , Oymak and Ozay [55, Thm. 3.1] showed that the estimation error in spectral norm is with high probability, where is the number of samples. For the subclass of the regular unstable systems [57, Def. 3], Faradonbeh et al. [57, Thm. 1] proved that the probability of estimation error exceeding a positive threshold in spectral norm decays exponentially in . For the Gauss-Markov processes considered in the present paper, Simchowitz et al. [54, Thm. B.1] and Sarkar and Rakhlin [56, Prop. 4.1] presented tail bounds on the estimation error of the ML estimate.
Another line of work closely related to this paper is the large deviation principle (LDP) [58, Ch. 1.2] on . Given an error threshold , define and as follows:
[TABLE]
We also define as
[TABLE]
The large deviation theory studies the rate functions, defined as the limits of , and , as goes to infinity. Bercu et al. [8, Prop. 8] found the rate function for the case of . For , Worms [9, Thm. 1] proved that the rate functions can be bounded from below implicitly by the optimal value of an optimization problem.
These studies of the limiting distribution and the LDP of the estimation error are asymptotic. In this paper, we develop a nonasymptotic analysis of the estimation error. Two nonasymptotic lower bounds on and are available in the literature. For any , Rantzer [10, Th. 4] showed that
[TABLE]
Bercu and Touati [59, Cor. 5.2] proved that
[TABLE]
where is the unique positive solution to in . Both bounds (10) and (11) do not capture the dependence on and , and are the same for and . The bounds in [54, 55, 56, 57, 10] either are optimal only order-wise or involve implicit constants. Our main result on parameter estimation is a tight nonasymptotic lower bound on and . For larger , the lower bound becomes larger, which suggests that unstable systems are easier to estimate than stable ones, an observation consistent with [54]. The proof is inspired by Rantzer [10, Lem. 5], but our result improves Rantzer’s result (10) and Bercu and Touati’s result (11), see Fig. 1 for a comparison. Most of our results generalize to the case where ’s are i.i.d. sub-Gaussian random variables, see Theorem 4 in Section III-D below.
II-B Nonasymptotic Rate-distortion Theory
The rate-distortion theory studies the problem of compressing a generic random process with minimum distortion. Given a distortion threshold , an excess-distortion probability and the number of codewords , an lossy compression code for a random vector consists of an encoder , and a decoder , such that , where is the distortion measure. This paper considers the mean squared error (MSE) distortion: ,
[TABLE]
The minimum achievable code size and source coding rate are defined respectively by
[TABLE]
In this paper, we approximate the nonasymptotic coding rate for the nonstationary Gauss-Markov source.
Another related and widely studied setting is compression under the average distortion criterion. Given a distortion threshold and the number of codewords , an lossy compression code for a random vector consists of an encoder , and a decoder , such that . Similarly, one can define and as the minimum achievable code size and source coding rate, respectively, under the average distortion criterion. The traditional rate-distortion theory [11, 12, 17, 18, 15, 16] showed that the limit of the operational source coding rate as tends to infinity equals the informational rate-distortion function for a wide class of sources. For discrete memoryless sources, Zhang, Yang and Wei in [23] showed that approaches the rate-distortion function as . For abstract alphabet memoryless sources, Yang and Zhang in [24, Th. 2] showed a similar convergence rate.
Under the excess-distortion probability criterion, one can also study the nonasymptotic behavior of the minimum achievable excess-distortion probability :
[TABLE]
Marton’s excess distortion exponent [21, Th. 1, Eq. (2)-(3), (20)] showed that for discrete memoryless sources , it holds that
[TABLE]
where the minimization is over all probability distributions such that , where is such that is a constant, denotes the rate-distortion function of a discrete memoryless source with single-letter distribution , and denotes the Kullback-Leibler divergence. As pointed out by [25, p. 2], for fixed and , even the limit of as goes to infinity is unanswered by Marton’s bound in (16). Ingber and Kochman [25] (for finite-alphabet and Gaussian sources) and Kostina and Verdú [26] (for abstract sources) showed that the minimum achievable source coding rate admits the following expansion, known as Gaussian approximation [60].
[TABLE]
where is the dispersion of the source (defined as the variance of the tilted information random variable, details later) and denotes the inverse Q-function. In this paper, by extending our previous analysis [48, Th. 1] of the stationary Gauss-Markov source to the nonstationary one, we establish the Gaussian approximation in the form of (17) for the nonstationary Gauss-Markov sources. One of the key ideas behind this extension is to construct a typical set using the ML estimate of , and to use our estimation error bound to probabilistically characterize that set.
III Parameter Estimation
III-A Nonasymptotic Lower Bounds
We first present our nonasymptotic bounds on and , defined in (7) and (8) above, respectively. We define two sequences and as follows. Let and be fixed constants. For and a parameter , let be the following sequence
[TABLE]
Similarly, let be the following sequence
[TABLE]
Note the subtle difference between (19) and (21): there is a negative sign in the numerator in (21). Both sequences depend on and . We derive closed-form expressions and analyze the convergence properties of and in Appendices A-B and A-C below. For and , we define the following sets
[TABLE]
Theorem 1**.**
For any constant , the estimator (2) satisfies for any ,
[TABLE]
where and are defined in (19) and (21), respectively, and and are defined in (22) and (23), respectively.
Theorem 1 is a useful result for numerically computing lower bounds on and . In Fig. 1, we plot our lower bounds in Theorem 1, previous results in (10) by Rantzer and (11) by Bercu and Touati, and a simulation result. As one can see, our bound in Theorem 1 is much tighter than previous results.
The proof of Theorem 1, presented in Appendix A-A below, is a detailed analysis of the Chernoff bound using the tower property of conditional expectations. The proof is motivated by [10, Lem. 5], but our analysis is more accurate and the result is significantly tighter, see Fig. 1 and Fig. 3 for comparisons. One recovers Rantzer’s lower bound (10) by setting and bounding as (due to the monotonicity of shown in Appendix A-B below) in Theorem 1. We explicitly state where we diverge from [10, Lem. 5] in the proof in Appendix A-A below.
Remark 1*.*
In view of the Gärtner-Ellis theorem [58, Th. 2.3.6], we conjecture that the bounds (24) and (25) can be reversed in the limit of large :
[TABLE]
and similarly for (25).
III-B Asymptotic Lower Bounds
We next present our bounds on the error exponents, that is, the limits of , and as tends to infinity. To take limits using (24) and (25), we need to understand the two sequences of sets and . Define the limits of the sets as
[TABLE]
We have the following properties.
Lemma 1**.**
Fix any constant .
- •
(Monotone decreasing sets) For any , we have
[TABLE]
- •
(Limits of the sets) It holds that
[TABLE]
[TABLE]
The proof of Lemma 1 is presented in Appendix A-D below. The exact characterization of and for each using is involved. One can see from the definitions (22) and (23) that
[TABLE]
To obtain the set from , we need to solve , which is equivalent to solving an additional inequality involving a polynomial of degree in (using the closed-form expression for in (128) in Appendix A-B below). Fig. 2 presents a plot of for . Despite the complexity of the sets and , Lemma 1 shows their monotonicity property and limits.
Combining Theorem 1 and Lemma 1, we obtain the following lower bounds on the error exponents. The proof is given in Appendix A-E below.
Theorem 2**.**
Fix any constant . For the ML estimator (2), the following three inequalities hold:
[TABLE]
where
[TABLE]
with the thresholds and given by
[TABLE]
Remark 2*.*
The results in (30)-(31) and (33)-(34) indicate the asymmetry between and : the set has a larger range than , and , which suggests that the maximum likelihood estimator is more likely to underestimate than to overestimate it.
Fig. 3 presents a comparison of (35), Rantzer’s bound (10) and Bercu and Touati (11). Our bound (35) is tighter than both of them for any .
III-C Decreasing Error Thresholds
When the number of samples increases, it is natural to have error threshold decrease. In this section, we consider the regime where the error threshold is a sequence decreasing to 0. In this setting, Theorem 1 still holds and the proof stays the same, except that we replace and , by the length- sequences and for , respectively, where and now depend on instead of a constant :
[TABLE]
The sequence is defined in a similar way. For Theorem 2 to remain valid, we require no smaller than to ensure that the right sides of (24)-(25) still converge to the right sides of (33)-(34), respectively. Let be a positive sequence such that
[TABLE]
Theorem 3**.**
For any and , let be a positive sequence satisfying (41). Then, Theorem 1 holds with replaced by , and by , and Theorem 2 holds with (33) and (34) replaced, respectively, by
[TABLE]
The proof of Theorem 3 is presented in Appendix A-F below. Theorem 3 is a quite strong result as it states that even if the error threshold is a sequence decreasing to zero, as long as (41) is satisfied, the probability of estimation error exceeding such decreasing thresholds is still exponentially small, with exponent being at least .
Corollary 1**.**
For any and any , there exists a constant such that for all large enough,
[TABLE]
Corollary 1 is used in Section IV-E below to derive the dispersion of nonstationary Gauss-Markov sources. The proof of Corollary 1 is by applying Theorem 3 with chosen as
[TABLE]
III-D Generalization to sub-Gaussian ’s
In this section, we generalize the above results to the case where ’s in (1) are zero-mean sub-Gaussian random variables. This general result is of independent interest and will not be used in the rest of the paper.
Definition 1** (sub-Gaussian random variable, e.g. [61, Def. 2.7]).**
Fix . A random variable with mean is said to be -sub-Gaussian with variance proxy if its moment-generating function (MGF) satisfies
[TABLE]
for all .
One important property of -sub-Gaussian random variables is the following well-known bound on the MGF of quadratic functions of -sub-Gaussian random variables.
Lemma 2** ([10, Prop. 2]).**
Let be a -sub-Gaussian random variable with mean . Then
[TABLE]
for any .
Equality holds in (46) and (47) when is Gaussian. In particular, the right side of (47) is the MGF of the noncentral -distributed random variable .
Theorem 4** (Generalization to sub-Gaussian case).**
Theorems 1–3 and Lemma 1 remain valid for the estimator (2) when ’s in (1) are i.i.d. zero-mean -sub-Gaussian random variables.
The generalizations of Theorems 1–3 and Lemma 1 from Gaussian to sub-Gaussian ’s only require minor changes in the corresponding proofs. See Appendix A-G for the details.
IV The Dispersion of a Nonstationary Gauss-Markov Source
IV-A Rate-distortion functions
For a generic random process , the -th order (informational) rate-distortion function is defined as
[TABLE]
where is the -dimensional random vector determined by the random process, is the mutual information between and , is a given distortion threshold, and is the distortion measure defined in (12) in Sec. II-B above. The rate-distortion function is defined as
[TABLE]
For a wide class of sources, has been shown to be equal to the minimum achievable source coding rate under the average distortion criterion, in the limit of , see [11] for discrete memoryless sources and [12] for general ergodic sources. In particular, Gray’s coding theorem [17, Th. 2] for the Gaussian autoregressive processes directly implies that for the Gauss-Markov source in (1) for any , its rate-distortion function equals the minimum achievable source coding rate under the average distortion criterion as tends to infinity. The -th order rate-distortion function of the Gauss-Markov source is given by the -th order reverse waterfilling, e.g. [17, Eq. (22)]:
[TABLE]
where is the -th order water level, and ’s for (sorted in nondecreasing order) are the eigenvalues of the matrix with being an lower triangular matrix defined as
[TABLE]
One can check that is the covariance matrix of . The way that one uses (50)-(51) is to first solve the -th order water level using (51) for a given distortion threshold , and then to plug that water level into (50) to obtain . The rate-distortion function of the Gauss-Markov source is given by the limiting reverse waterfilling:
[TABLE]
where is the limiting water level and is a function from to given by
[TABLE]
The rate-distortion function of the Gaussian memoryless source (the special case when is set to 0 in the Gauss-Markov model) is [11]
[TABLE]
One can obtain (56) from (53)-(54) by noting that for , which further simplifies (54) to , and (53) to (56). See Fig. 4 for a plot of and .
IV-B Operational Dispersion
To characterize the convergence rate of the minimum achievable source coding rate (defined in (14) in Section II-B above) to the rate-distortion function, we define the operational dispersion for the Gauss-Markov source as
[TABLE]
where denotes the inverse Q-function. The main result in the second part of this paper gives for the nonstationary Gauss-Markov source.
IV-C Informational Dispersion
The -tilted information [26, Def. 6] is the key random variable in our nonasymptotic analysis of . Under other names, the -tilted information has also been studied by Blahut [62, Th. 4] and Kontoyiannis [37, Sec. III-A]. Using the definition in [26, Def. 6], the -tilted information in is
[TABLE]
where is the negative slope of at the distortion level and is the random variable that achieves the infimum in (48) for . In [48, Lem. 7, Eq. (228)], by a decorrelation argument, we obtained the following expression for the -tilted information for the Gauss-Markov source: for any and any ,
[TABLE]
where is given by (51), with being an orthonormal matrix that diagonalizes , and
[TABLE]
with ’s being the eigenvalues of the matrix . We refer to the random variable , defined by
[TABLE]
as the decorrelation of . Note that the decorrelation has independent coordinates and
[TABLE]
Using (50)-(51) and (62), one can show [48, Eq. (55) and (228)] that the -tilted information in for the Gauss-Markov source satisfies . The minimum achievable source coding rates (defined in (14)) for lossy compression of and are equal, as are their rate-distortion functions: , see [48, Sec. III.A] for the details. It is known [26, Property 1] that the -tilted information satisfies (by the Karush-Kuhn-Tucker conditions for the optimization problem (48))
[TABLE]
The informational dispersion is defined as the limit of the variance of the -tilted information normalized by :
[TABLE]
By decorrelating the Gauss-Markov source and analyzing the limiting behavior of the eigenvalues of the covariance matrix of , we obtain the following reverse waterfilling representation for the informational dispersion. The proof is given in Appendix B-A below.
Lemma 3**.**
The informational dispersion of the nonstationary Gauss-Markov source is given by
[TABLE]
where is given in (54), and is in (55).
Notice that the informational dispersion in the nonstationary case is given by the same expression as in the stationary case [48, Eq. (57)]. It is known, e.g. [26, Eq. (94)] and [25, Sec. IV], that the informational dispersion for the Gaussian memoryless source is
[TABLE]
See Fig. 5 for a plot of and .
IV-D A Few Remarks
In view of (54), there are two special water levels and , defined as follows:
[TABLE]
and
[TABLE]
The critical distortion is defined as the distortion corresponding to the water level . By (54), we have
[TABLE]
The maximum distortion is defined as the distortion corresponding to the water level . By (54), we have
[TABLE]
Using similar techniques as in [48, Eq. (169)–(172)], one can compute the integral in (70) as
[TABLE]
In this paper, we always consider a fixed distortion threshold such that .
Remark 3*.*
Gray [17, Eq. (24)] showed the following relation between the rate-distortion function of the Gauss-Markov source and of the Gaussian memoryless source:
[TABLE]
Using Lemma 3 above, one can easily show (in the same way as [48, Cor. 1]) that their dispersions are also comparable:
[TABLE]
The results in (72)-(73) imply that for low distortions , the minimum achievable source coding rate in compressing the Gauss-Markov source and the Gaussian memoryless source are the same up to second-order terms, a phenomenon we observed in the stationary case as well [48, Cor. 1]. See Fig. 4 and Fig. 5 for a visualization of (72) and (73), respectively.
Remark 4*.*
For the function , we show that
[TABLE]
This result has an interesting connection to the problem of control under communication constraints: in [63] [64, Th. 1] [65, Prop. 3.1], it was shown that the minimum rate to asymptotically stabilize a linear, discrete-time, scalar system is also . The result in (74) implies that stability cannot be attained with any rate lower than even if an infinite lookahead is allowed. The derivation of (74) is presented in Appendix B-C below.
Remark 5*.*
Let and be the two special points on the curve at distortions and , respectively. Then, the coordinates of and are given by
[TABLE]
The derivation for is the same as that in the stationary case [48, Eq. (61)] except that we need to compute the residue at instead of at since we now have , see [48, App. B-A] for details.
IV-E Second-order Coding Theorem
Our main result establishes the equality between the operational dispersion and the informational dispersion.
Theorem 5** (Gaussian approximation).**
For the Gauss-Markov source (1) with , any fixed excess-distortion probability , and distortion threshold , it holds that
[TABLE]
Specifically, we have the following converse and achievability.
Theorem 6** (Converse).**
For the Gauss-Markov source with , any fixed excess-distortion probability , and distortion threshold , the minimum achievable source coding rate satisfies
[TABLE]
where denotes the inverse Q-function, is the rate-distortion function given in (53), and is the informational dispersion given by Lemma 3 above.
The converse proof is similar to that in the asymptotically stationary case in [48, Th. 7]. See Appendix D for the details.
Theorem 7** (Achievability).**
In the setting the Theorem 6, the minimum achievable source coding rate satisfies
[TABLE]
Theorem 5 follows immediately from Theorems 6 and 7. Central to the achievability proof of Theorem 7 is the following random coding bound: there exists an code such that [26, Cor. 11]
[TABLE]
where the infimization is over all random variables defined on and denotes the distortion -ball around :
[TABLE]
To obtain the achievability in (78) from (79), we need to bound from below the probability that falls within the distortion -ball , where and are independent, in terms of the informational dispersion. This connection is made via the following second-order refinement of the “lossy AEP” (asymptotic equipartition property [11, Lem. 1] [38, Th. 1] [26, Lem. 2]) that applies to the nonstationary Gauss-Markov sources.
Lemma 4** (Second-order lossy AEP for the nonstationary Gauss-Markov sources).**
For the Gauss-Markov source with , let be the random variable that attains the minimum in (48) with there replaced by . It holds that
[TABLE]
where
[TABLE]
and ’s, , are positive constants depending only on and .
The proof of Lemma 4 is presented in Appendix F-E below. The proof of Theorem 7, which uses uses the random coding bound (79) and Lemma 4, is presented in Appendix E below.
IV-F The Connection between Lossy AEP and Parameter Estimation
The proof of lossy AEP in the form of Lemma 4 is technical even for stationary memoryless sources [26, Lem. 2]. A lossy AEP for stationary -mixing processes was derived in [38, Cor. 17]. For stationary memoryless sources with single-letter distribution , the idea in [26, Lem. 2] is to form a typical set of source outcomes [26, Lem. 4] using the product of the empirical distributions [26, Eq. (270)]: , where is the empirical distribution of a given source sequence , and then to show that the inequality inside the bracket in (81) holds for and that the probability of the complement set is at most , where and [26, Lem. 2]. The Gauss-Markov source is not memoryless, and it is nonstationary for . To form a typical set of source outcomes, we define the following proxy random variables using the estimator in (2).
Definition 2** (Proxy random variables).**
For each sequence of length generated by the Gauss-Markov source, define the proxy random variable as an -dimensional Gaussian random vector with independent coordinates, each of which follows the distribution with
[TABLE]
where is in (2) above.
Remark 6*.*
The proxy random variable in Definition 2 differs from that in [48, Eq. (119)] for the stationary case in the behavior of the largest variance . For each realization , we construct the Gaussian random vector according to (84)-(85), which is a proxy to the decorrelation in (61) above. The variances of and are very close due to the closeness of to (Corollary 1).
Remark 7*.*
Since the proxy random variable depends on the realization of , Definition 2 defines the joint distribution of , where is the decorrelation of in (61) above.
The following convex optimization problem will be instrumental: for two generic random vectors and with distributions and , respectively, define
[TABLE]
where is the conditional relative entropy. See Appendix F-B for detailed discussions on this optimization problem.
For each realization (equivalently, each with the matrix defined in the text above (60)), we define random variables as follows.
- •
Let be the decorrelation of in (61) above. Let be the random variable that attains the infimum in .
- •
For each , choose in (86) to be the proxy random variable , and choose to be . Let be the random variable that attains the infimum in .
Then, for each , define
[TABLE]
Denote
[TABLE]
The typical set for the Gauss-Markov source is then defined as follows.
Definition 3** (Typical set).**
For any , and a constant , define to be the set of vectors that satisfy the following conditions:
[TABLE]
where is the decorrelation (61) and ’s are defined in (60) above.
The typical set in Definition 3 is in the same form as that in the stationary case [48, Def. 2], but the definitions of proxy random variables and the analyses are different.
Theorem 8**.**
For any , there exists a constant such that the probability that the Gauss-Markov source produces a typical sequence satisfies
[TABLE]
Corollary 1 is essential to the proof of Theorem 8. See the details in Appendix F-C.
Let denote the event inside the square bracket in (81). To prove Lemma 4, we intersect with the typical set and the complement , respectively, and then we bound the probability of the two intersections separately. See Appendix F-E for the details.
V Discussion
V-A Stationary and Nonstationary Gauss-Markov Processes
It took several decades [13, 15, 17, 22, 19] to completely understand the difference in rate-distortion functions between stationary and nonstationary Gaussian autoregressive sources. We briefly summarize this subtle difference here to make the point that generalizing results from the stationary case to the nonstationary one is natural but nontrivial.
Since , the eigenvalues ’s of satisfy
[TABLE]
Using (93), we can equivalently rewrite (50) as
[TABLE]
where is in (51) and ’s are in (60). Both (50) and (94) are valid expressions for the -th order rate-distortion function , regardless of whether the source is stationary or nonstationary. The classical Kolmogorov reverse waterfilling result [13, Eq. (18)], obtained by taking the limit in (94), implies that the rate-distortion function of the stationary Gauss-Markov source () is given by (the subscript K stands for Kolmogorov)
[TABLE]
where is given in (54) and is given in (55). While (53) and (54) are valid for both stationary and nonstationary cases, Hashimoto and Arimoto [22] noticed in 1980 that (95) is incorrect for the nonstationary Gaussian autoregressive source. The reason is the different asymptotic behaviors of the eigenvalues ’s of (52) in the stationary and nonstationary cases: while in the stationary case, the spectrum is bounded away from zero, in the nonstationary case, the smallest eigenvalue approaches 0, causing a discontinuity. By treating that smallest eigenvalue in a special way, Hashimoto and Arimoto [22, Th. 2] showed that
[TABLE]
is the correct rate-distortion function for both stationary and nonstationary Gauss-Markov sources, where the subscript HA stands for the authors of [22]. For the general higher-order Gaussian autoregressive source, the correction term needed in (96) depends on the unstable roots of the characteristic polynomial of the source, see [22, Th. 2] for the details. In 2008, Gray and Hashimoto [19] showed the equivalence between in (96), obtained by taking a limit in (94), and Gray’s result in (53), obtained by taking a limit in (50).
The tool that allows one to take limits in (94) and (50) is the following theorem on the asymptotic eigenvalue distribution of the almost Toeplitz matrix , which is the (rescaled) inverse of the covariance matrix of . Denote
[TABLE]
and
[TABLE]
Gray [66, Th. 2.4] generalized the result of Grenander and Szegö [67, Th. in Sec. 5.2] on the asymptotic eigenvalue distribution of Toeplitz forms to that of matrices that are asymptotically equivalent to Toeplitz forms, see [66, Chap. 2.3] for the details. Define
[TABLE]
Theorem 9** (Gray [17, Eq. (19)], Hashimoto and Arimoto [22, Th. 1]).**
For any continuous function over the interval
[TABLE]
the eigenvalues ’s of with in (52) satisfy
[TABLE]
where is defined in (55).
The eigenvalues ’s behave quite differently in the following three cases, leading to the subtle difference in the corresponding rate-distortion functions.
For the stationary case , it can be easily shown [48, Eq. (71)] that and all eigenvalues ’s lie in between and . Kolmogorov’s formula (95) is obtained by applying Theorem 9 to (94) using the function
[TABLE]
where is given by (54). 2. 2.
For the Wiener process (), closed-form expressions of ’s are given by Berger [15, Eq. (2)]. Those results imply that the smallest eigenvalue is of order , and thus . Using the same function as in (102), Berger obtained the rate-distortion functions for the Wiener process [15, Eq. 4] 111To be precise, although the rate-distortion function for the Wiener process is correct in [15, Eq. 4], the proof there is not rigorous since in this case but is not continuous at as pointed out in [19, Eq. (23)]. Therefore, the limit leading to [15, Eq. 4] needs extra justifications.. 3. 3.
For the nonstationary case , we have , the smallest eigenvalue is of order and the other eigenvalues lie in between and . This behavior of eigenvalues was shown by Hashimoto and Arimoto [22, Lemma] for higher-order Gaussian autoregressive sources, and we will show a refined version for the Gauss-Markov source in Lemma 5 below. As pointed out in [22, Th. 1], an application of Theorem 9 using the function (102) fails to yield the correct rate-distortion function for nonstationary sources due to the discontinuity of at 0. Gray [17, Eq. (22)] and Hashimoto and Arimoto [22] circumvent this difficulty in two different ways, which lead to (53) and (96), respectively. Gray [17] applied Theorem 9 on (50) using the function
[TABLE]
which is indeed continuous at [math], while Hashimoto and Arimoto [22, Th. 2] still use the function but consider and separately:
[TABLE]
which in the limit yields (96) by plugging into (102).
V-B New Results on the Spectrum of the Covariance Matrix
The following result on the scaling of the eigenvalues ’s refines [22, Lemma]. Its proof is presented in Appendix B-D.
Lemma 5**.**
Fix . For any , the eigenvalues of (52) are bounded as
[TABLE]
where
[TABLE]
The smallest eigenvalue is bounded as
[TABLE]
where and are constants given by
[TABLE]
Remark 8*.*
The constant in (108) is positive, while in (109) can be positive, zero or negative, depending on the value of . Lemma 5 indicates that is a good approximation to . Using (105)–(106), we deduce that for ,
[TABLE]
Based on Lemma 5, we obtain a nonasymptotic version of Theorem 9, which is useful in the analysis of the dispersion, in particular, in deriving Proposition 1 in Appendix C-A below.
Theorem 10**.**
Fix any . For any bounded, -Lipschitz and nondecreasing function (or nonincreasing function) over the interval (100) and any , the eigenvalues ’s of (52) satisfy
[TABLE]
where is defined in (55) and is a constant that depends on and the maximum absolute value of .
The proof of Theorem 10 is in Appendix B-E.
VI Conclusion
In this paper, we obtain nonasymptotic (Theorem 1) and asymptotic (Theorem 2) bounds on the estimation error of the maximum likelihood estimator of the parameter of the nonstationary scalar Gauss-Markov process. Numerical simulations in Fig. 1 confirm the tightness of our estimation error bounds compared to previous works. As an application of the estimation error bound (Corollary 1), we find the dispersion for lossy compression of the nonstationary Gauss-Markov sources (Theorems 6 and 7). Future research directions include generalizing the error exponent bounds in this paper, applicable to identification of scalar dynamical systems, to vector systems, and finding the dispersion of the Wiener process.
Appendix A
A-A Proof of Theorem 1
Proof.
We present the proof of (24). The proof of (25) is similar and is omitted. For any , denote by the -algebra generated by . For any , , and , we denote the following random variable
[TABLE]
By the Chernoff bound, we have
[TABLE]
To compute , we first consider the conditional expectation . Since is the only term in that does not belong to , we have
[TABLE]
where is the deterministic function of and defined in (18), and (115) follows from the moment generating function of . To obtain a recursion, we then consider the conditional expectation . Since and are the only two terms in that do not belong to , we use the relation and we complete squares in to obtain
[TABLE]
Furthermore, using the formula for the moment generating function of the noncentral -distributed random variable
[TABLE]
with 1 degree of freedom, we obtain
[TABLE]
This is where our method diverges from Rantzer [10, Lem. 5], who chooses and bounds (due to Property A4 in Appendix A-B below) in (118). Instead, by conditioning on in (118) and repeating the above recursion for another times, we compute exactly using the sequence :
[TABLE]
If , then by the definition of the set we have . Therefore,
[TABLE]
∎
A-B Properties of the Sequence
We derive several important elementary properties of the sequences and . First, we consider . We find the two fixed points of the recursive relation (19) by solving the following quadratic equation in :
[TABLE]
Property A1
For any and , (121) has two roots , and . The two roots and are given by
[TABLE]
where denotes the discriminant of (121):
[TABLE]
Proof.
Note that the discriminant satisfies
[TABLE]
where we used . Then, (122) implies . ∎
Property A2
For and , the sequence is a geometric sequence with common ratio
[TABLE]
Furthermore,
[TABLE]
and it follows immediately that
[TABLE]
Proof.
Using the recursion (19) and the fact that and are the fixed points of (19), one can verify that is a geometric sequence with common ratio given by (126). The relation (127) is verified by direct computations using (122) and (123). ∎
Property A3
For any and , we have
[TABLE]
For , we have .
Proof.
The limit (130) follows from (127) and (128). Plugging into (18) yields , which implies by (19) that for . ∎
Property A4
For any , we have and decreases to geometrically. For , (130) still holds, but the convergence is not monotone: there exists an such that and increases to for ; and and increases to for .
Proof.
Due to (129), the monotonicity of depends on the signs of and . Note that by Property A1. Plugging into (121), we have
[TABLE]
Since for , we have by (18); we must also have by (131). Due to (128) and (129), this immediately implies that decreases to . Therefore, . For any , we have and . In fact, since , we have , which implies . Therefore, the conclusion follows from (129). ∎
Property A5
For any , the root in (122) is a decreasing function in .
Proof.
Direct computations using (122), (124) and the assumption that . ∎
A-C Properties of the Sequence
The sequence is analyzed similarly, although it is slightly more involved than . We only consider in the rest of this section. We find the two fixed points of the recursive relation (21) by solving the following quadratic equation in :
[TABLE]
Property B1
For , we have . For any and , (132) has two distinct roots , given by
[TABLE]
where the discriminant of (132) is
[TABLE]
Proof.
We verify that for any and . The reason that is not as obvious as (125) is due to the subtle difference between (124) and (135) in the negative sign of . Note that in (135) is a quadratic equation in and the discriminant of is given by
[TABLE]
Hence, in general, (135) has two roots (distinct when ) and could be positive or negative. However, an analysis of two cases and reveals that for any and . Therefore, (132) has two distinct roots given in (133) and (134) above. From (132), we have , which is negative for . Therefore, we have . ∎
Property B2
For any and , the sequence is a geometric sequence with common ratio
[TABLE]
In addition, for any and , we also have
[TABLE]
It follows immediately that
[TABLE]
Proof.
Similar to that of Property A2 above for . ∎
Property B3
For any and , we have , and decreases to geometrically:
[TABLE]
Proof.
This can be verified using (139) and (140) by noticing that and that for ,
[TABLE]
∎
Property B4
For any constant , the two thresholds and , defined in (37) and (38), respectively, satisfy the following Then,
When , the root in (133) is an increasing function in . 2. 2.
When , is a decreasing function in . 3. 3.
When , is a decreasing function in and an increasing function in , where is the unique solution in the interval to
[TABLE]
and is given by
[TABLE]
Proof.
Using (133) and (135), we compute the derivatives of as follows:
[TABLE]
To simplify notations, denote by the first derivative:
[TABLE]
From (145), we have
[TABLE]
and
[TABLE]
where is given by
[TABLE]
Since is an increasing function in due to (146), to determine the monotonicity of , we only need to consider the following three cases.
a) When , or equivalently, , we have for any . Hence, is an increasing function in .
b) When , we have for any . Hence, is a decreasing function in . We now show that is equivalent to . When , we have by (149) and . When , it is easy to see from (149) that is equivalent to . Hence, the equivalent condition for is .
c) When and , or equivalently, , solving (143) using (145) yields (144). Since is monotonically increasing due to (146), we know that given by (144) is the unique solution to (143) in , and for and for . ∎
A-D Proof of Lemma 1
Proof.
We first show the monotone decreasing property. The set contains all such that are all less than , while the set contains all such that are all less than , hence . The same argument yields the conclusion for .
We then prove that . Property A4 above in Appendix A-B implies that for any , we have . Hence for any . To show the other direction, it suffices to show that for any , there exists such that . Let be the integer defined in Property A4 above. Then, satisfies the following two conditions
[TABLE]
We show that , which would complete the proof. Due to , using (129) and (152), we have
[TABLE]
where (155) 222It is pretty amazing that (155) is in fact an equality. is by plugging (122), (123) and (126) into (154).
Finally, to show (31), for any , we have , hence . The other direction cannot hold since there are many counterexamples, e.g., , , and , where the sequence increases monotonically to . Hence, in this case, but . ∎
A-E Proof of Theorem 2
Proof.
Theorem 1 and Lemma 1 imply that for any ,
[TABLE]
Recall that depends on . By (130), the continuity of the function and the Cesàro mean convergence, we have
[TABLE]
where depends on via (122). Since (157) holds for any , using Property A5 in Appendix A-B above and supremizing (157) over , we obtain (33). Specifically, the supremum of (157) over is achieved in the limit of going to the right end point . Plugging into (122), we obtain the corresponding value for :
[TABLE]
which is further substituted into (157) to yield (33).
Similarly, to show (34), using Property B3 in Appendix A-C above, we have
[TABLE]
Then, by Property B4 in Appendix A-C above, the supermizer in (159) is given by
[TABLE]
where is given by (144). Plugging (160) into (159) yields (34).
Finally, the bound (35) follows from (33) and (34), since
[TABLE]
and
[TABLE]
∎
A-F Proof of Theorem 3
Proof.
For any sequence , the proof of Theorem 1 in Appendix A-A above remains valid with replaced by defined in (40) in Section III-C above. We present the proof of (42), and omit that of (43), which is similar. In this regime, for each , the proof of Lemma 1 implies that
[TABLE]
Then, in (24), we choose
[TABLE]
First, using (122)-(123), (126) and the choice (165), we can determine the asymptotic behavior of quantities involved in determining in (128) and (129) (with replaced by and replaced by ), summarized in TABLE I.
We make two remarks before proceeding further. It can be easily verified from (126) that the common ratio is a constant belonging to and
[TABLE]
Hence, for all large , is bounded by positive constants between 0 and 1. Besides, from (122), we have
[TABLE]
Second, from (128), (24) and the choice (165), we have
[TABLE]
where and in this regime depend on with order dependence given in TABLE I above. Using the inequality , we have
[TABLE]
Since due to (123), we can further bound as
[TABLE]
where in the last step we used the results in TABLE I. Due to the assumption (41) on and (167), we obtain (42). ∎
A-G Proof of Theorem 4
Proof.
We point out the proof changes in generalizing our results to the sub-Gaussian case. There are two changes to be made in the proof of Theorem 1 in Appendix A-A above: the equality from (114) to (115) is replaced by since is -sub-Gaussian; the equality in (118) is replaced by due to Lemma 2. The rest of the proof for Theorem 1 remains the same for the sub-Gaussian case. Since Lemma 1 and Theorems 2, 3 depend only on the properties of the sequences and , and (24)-(25) continue to hold for sub-Gaussian ’s, the proofs of Lemma 1 and Theorems 2, 3 remain exactly the same for the sub-Gaussian case. ∎
Appendix B
B-A Proof of Lemma 3
Proof.
In view of (62), we take the variances of both sides of (59) to obtain
[TABLE]
Note that , where is the water level given by (54). Applying Theorem 9 in Section V-A to (173) with the function
[TABLE]
which is continuous at , we obtain (65). ∎
B-B An Integral
We present the computation of an interesting integral that is useful in obtaining the value of .
Lemma 6**.**
For any constant , it holds that
[TABLE]
Proof.
Denote
[TABLE]
By Leibniz’s rule for differentiation under the integral sign, we have
[TABLE]
With the change of variable and partial-fraction decomposition, we obtain the closed-form solution to the integral in (178):
[TABLE]
It can be easily verified by directly taking derivatives that the right-side of (175) is indeed the antiderivative of (179). ∎
B-C Derivation of in (74)
We present two ways to obtain (74). The first one is to directly use (96) in Section V-A. For , we have in (95), then (74) immediately follows from (96). The second method relies on (53). For , observe from (53) that
[TABLE]
Then, computing the integral (180) using Lemma 6 in Appendix B-B yields (74).
B-D Proof of Lemma 5
Proof.
The bound (105) is obtained by partitioning (52) into its leading principal submatrix of order and then applying the Cauchy interlacing theorem to that partition, see [48, Lem. 1] for details. To obtain (107), observe from (93)
[TABLE]
Combining (181) and (105) yields
[TABLE]
where
[TABLE]
Plugging (106) into (183) and then taking the limit, we obtain
[TABLE]
where the last equality is due to Lemma 6 in Appendix B-B above. In the rest of the proof, we obtain the following refinement of (185): for any ,
[TABLE]
where and are the constants given by (108) and (109) in Lemma 5, respectively. Then, (107) will follow directly from (182), (186) and (187).
The proofs of the refinements (186) and (187) are similar, and both are based on the elementary relations between Riemann sums and their corresponding integrals. We present the proof of (186), and omit that of (187). Note that the function is an increasing function in , and its derivative is bounded above by for any fixed . Therefore, from (106) and (183), we have
[TABLE]
and (186) follows immediately. ∎
B-E Proof of Theorem 10
Proof.
From Lemma 5, we know that (recall (97) and (99)). Since is an even function, we have
[TABLE]
Denote the maximum absolute value of over the interval (100) by . It is easy to check that the function is -Lipschitz since is -Lipschitz and the derivative of is bounded by . For the following Riemann sum
[TABLE]
the Lipschitz property implies that
[TABLE]
For , rewrite (106) and (105) as
[TABLE]
Denote the sum in (111) as
[TABLE]
Then, separating from and applying (193), we have
[TABLE]
Therefore, there is a constant depending on and such that (111) holds. ∎
Appendix C
We gather the frequently used notations in this section as follows. For any given distortion threshold ,
- •
let be the water level corresponding to in the limiting reverse waterfilling (54);
- •
for each , let be the water level corresponding to in the -th order reverse waterfilling (51);
- •
let be the distortion associated to the water level in the -th order reverse waterfilling (51).
For clarity, we explicitly write down the relations between and , and between and :
[TABLE]
where ’s are given in (60). Note that and are constants independent of , while and are functions of , and there is no direct reverse waterfilling relation between and . Applying Theorem 9 in Section V-A above to the function , we have
[TABLE]
and
[TABLE]
Theorem 10 in Section V-B then implies that the speed of convergence in (199) and (200) is in the order of .
C-A Expectation and Variance of the -tilted Information
Proposition 1**.**
For any and , let be defined in (198) above. Then, the expectation and variance of the -tilted information at distortion level satisfy
[TABLE]
where is the rate-distortion function given in (53), is the informational dispersion given in (65) and , are positive constants.
Proof.
Using the same derivation as that of (59), one can obtain the following representation of the -tilted information at distortion level :
[TABLE]
where is the decorrelation of defined in (61). Note that the difference between (59) and (203) is that is replaced by . Using (62) and taking expectations and variances of both sides of (203), we arrive at
[TABLE]
Applying Theorem 10 in Section V-B to (204) with the function defined in (103) yields (201). Similarly, applying Theorem 10 to (205) with the function (174) yields (202). ∎
Proposition 1 is one of the key lemmas that will be used in both converse and achievability proofs. Proposition 1 and its proof are similar to those of [48, Eq. (95)–(96)]. The difference is that we apply Theorem 10, which is the nonstationary version of [48, Th. 4], to a different function in (204).
C-B Approximation of the -tilted Information
The following proposition gives a probabilistic characterization of the accuracy of approximating the -tilted information at distortion level using the -tilted information at distortion level .
Proposition 2**.**
For any , there exists a constant (depending on only) such that for all large enough
[TABLE]
where is defined in (198).
Proof.
The proof in [48, App. D-B] works for the nonstationary case as well, since the proof [48, App. D-B] only relies on the convergences in (199) and (200) being both in the order of , which continues to hold for the nonstationary case. ∎
Remark 9*.*
The following high probability set is used in our converse and achievability proofs:
[TABLE]
Proposition 2 implies that for all large enough.
Appendix D Converse Proof
Proof of Theorem 6.
Using the general converse by Kostina and Verdú [26, Th. 7] and our established Propositions 1 and 2 in Appendix C, the proof is the same as the converse proof in the asymptotically stationary case [48, Th. 7, Eq. (97)–(109)]. For completeness, we give a proof sketch. Choosing and setting to be in [26, Th. 7], we know that any code for the Gauss-Markov source must satisfy
[TABLE]
By conditioning on the high probability set defined in Remark 9 above, we can further bound from below by
[TABLE]
From (203), we know that is a sum of independent random variables, whose mean and variance are bounded (within the order of due to Proposition 1) by the rate-distortion function and the informational dispersion . Choosing as in [48, Eq. (103)] and applying the Berry-Esseen theorem to , we obtain the converse in Theorem 6. ∎
Appendix E Achievability Proof
Proof of Theorem 7.
With our lossy AEP for the nonstationary Gauss-Markov source and Propositions 1 and 2, the proof is similar to the one for the stationary Gauss-Markov source in [48, Sec. V-C]. Here, we streamline the proof. As elucidated in Section IV-E above, the standard random coding argument [26, Cor. 11] implies that for any , there exists an code such that
[TABLE]
Choosing to be (the random variable that attains the minimum in (48) with there replaced by ), the bound (210) can be relaxed to
[TABLE]
To simplify notations, in the following, we denote by a constant that might be different from line to line. Given any constant , define as
[TABLE]
where is defined in (83) above. Note that for all large enough, we have . We choose as
[TABLE]
where is defined in (82) and is from Proposition 2 above. We also define the random variable as
[TABLE]
where is defined in (198) above. Note that all the randomness in is from , hence we will also use the notation to indicate one realization of the random variable . By bounding the deterministic part, that is, , of using Proposition 1, we know that with probability 1,
[TABLE]
where we use and to denote the expectation and variance of the informational dispersion at distortion level . Define the set as
[TABLE]
Then, in view of (203), the informational dispersion is a sum of independent random variables with bounded moments, and we apply the Berry-Esseen theorem to obtain
[TABLE]
We define one more set as
[TABLE]
Then, by the lossy AEP in Lemma 4 in Section IV-E above and Proposition 2, we have
[TABLE]
Finally, for any constant and large enough, we define as in (212) above and set as in (213). Then, there exists code such that
[TABLE]
where the last inequality is due to the definition of and (219). By further conditioning on , we conclude that there exists code such that
[TABLE]
Therefore, by the choice of in (213), the minimum achievable source coding rate must satisfy
[TABLE]
for all large enough, where is a universal constant and is a constant depending on . Here we change from to using a Taylor expansion. Therefore, Theorem 7 follows immediately from (224) with the choices of and given by (82) and (83), respectively, in the lossy AEP in Lemma 4 in Section IV-E above. We have in (78) since could be positive or negative. ∎
Appendix F Proof of Lossy AEP
F-A Notations
For the optimization problem in (86), the generalized tilted information defined in [26, Eq. (28)] in (a realization of ) is given by
[TABLE]
where and . For properties of the generalized tilted information, see [26, App. D]. For clarity, we list the notations used throughout this section:
denotes the decorrelation of defined in (61); 2. 2.
is the proxy random variable of defined in Definition 2 in Section IV-F above; 3. 3.
For that achieves in (48), is the random vector that achieves ; 4. 4.
We denote by the negative slope of (the same notation used in (58)):
[TABLE]
It is shown in [48, Lem. 5] that is related to the -th order water level in (51) by
[TABLE]
Given any source outcome , let be the decorrelation of . Define as the negative slope of w.r.t. :
[TABLE] 5. 5.
Comparing the definitions of -tilted information and the generalized tilted information, one can see that [48, Eq. (18)]
[TABLE] 6. 6.
Recalling (62) and applying the reverse waterfilling result [68, Th. 10.3.3], we know that the coordinates of are independent and satisfy
[TABLE]
where
[TABLE]
with given in (197).
F-B Parametric Representation of the Gaussian Conditional Relative Entropy Minimization
Various aspects of the optimization problem (86) have been discussed in [48, Sec. II-B]. In particular, let be the optimizer of , then we have
[TABLE]
where is in (48). Another useful result on the optimization problem (86) is the following: when and are independent Gaussian random vectors, the next theorem gives parametric characterizations for the optimizer and optimal value of (86).
Theorem 11**.**
Let be independent random variables with
[TABLE]
and be independent random variables with
[TABLE]
For any such that
[TABLE]
we have the following parametric representation for :
[TABLE]
[TABLE]
where is the parameter. Furthermore, equals the negative slope of w.r.t. :
[TABLE]
Similar results to Theorem 11 have appeared previously in the literature [43, 24, 38]. See [38, Example 1 and Th. 2] for the case of . For completeness, we present a proof.
Proof.
Fix any that satisfies (235), and let be such that (237) is satisfied. Note from (237) that is a strictly decreasing function in (unless for all ), hence such is unique. The upper bound on in (235) guarantees that . We first show the direction in (236). For , define the conditional distribution as
[TABLE]
We then define the joint distribution as
[TABLE]
Using (237), we can check that with such a choice of , the expected distortion between and equals . The details follow.
[TABLE]
where (243) is from the relation and (244) is due to (237). Therefore, the choice of in (239) and (240) is feasible for the optimization problem in defining . Hence,
[TABLE]
It is straightforward to verify that the Kullback-Leibler divergence between two Gaussian distributions and is given by
[TABLE]
Using (247) and (239), we see that (246) equals the right-hand side of (236). To prove the other direction, we use the Donsker-Varadhan representation of the Kullback-Leibler divergence [69, Th. 3.5]:
[TABLE]
where the supremum is over all functions from the sample space to such that both expectations in (248) are finite. Fix any such that . For any , in (248), we choose to be , to be and to be for any , then we have
[TABLE]
Taking expectations on both sides of (249) with respect to and then normalizing by , we have
[TABLE]
Using the formula for the moment generating function for noncentral distributions, we can compute
[TABLE]
Plugging (251) into (250) and using , we conclude that is greater than or equal to the right-hand side of (236). Finally, (238) is obtained by taking derivative of (236) w.r.t. , where we need to use the chain rule for derivatives since is a function of given by (237). ∎
Our next result states that for fixed ’s satisfying certain mild conditions, if we change the variances from ’s to ’s, then the perturbation on the corresponding ’s is controlled by the perturbation on ’s.
Theorem 12** (Variance perturbation).**
Let ’s and ’s be in (233) and (234) above, respectively. For a fixed satisfying (235), let be given by (237). Suppose that ’s and ’s are such that both
[TABLE]
and
[TABLE]
are bounded above by positive constants. Let be independent random variables with
[TABLE]
Let be such that
[TABLE]
Then, there is a constant such that
[TABLE]
Proof.
We can view (237) as an equation of the form . Then, by the implicit function theorem, we know that there exists a unique continuously differentiable function such that
[TABLE]
and
[TABLE]
Hence,
[TABLE]
By the assumptions (252) and (253), we know that there exists a constant such that
[TABLE]
Hence, we have
[TABLE]
∎
F-C Proof of Theorem 8
The proof is similar to [48, Th. 12]. We streamline the proof and point out the differences. We use the notations defined in Appendix F-A above.
Our Corollary 1 implies that for all large enough the condition (89) is violated with probability at most for a constant . This is much stronger than the bound in the stationary case [48, Th. 6].
In view of (62), the random variables for , are distributed according to i.i.d. standard normal distributions, and their -th moments equal to . The Berry-Esseen theorem implies that the condition (90) is violated with probability at most . This is the same as in the stationary case [48, Eq. (279)–(280)].
We use the following procedure to show that the condition (91) is violated with probability at most :
- •
We approximate by another random variable that is easier to analyze.
- •
We show that (91) with replaced by holds with probability at least .
- •
We then control the difference between and .
To carry out the above program, we first give an expression for by applying [48, Lem. 4] (see also the proof of Theorem 11) on . Note that and are Gaussian random vectors with independent coordinates with variances given by (85) and (230), respectively. Then, [48, Lem. 4] implies that the optimizer for satisfies
[TABLE]
where the conditional distributions are Gaussian:
[TABLE]
where ’s are defined in (231), and is defined in (228). Then, using the definition of in (87) and (264), we obtain
[TABLE]
where . The random variable in the form of (265) is hard to analyze since we do not have a simple expression for . By replacing with , we define another random variable that turns out to be easier to analyze:
[TABLE]
Plugging (227) and (231) into (266), we obtain
[TABLE]
where is the -th order water level in (51) and . The random variable is much easier to analyze since ’s are i.i.d. standard normal random variables. Moreover, in view of (51), their expectations satisfy
[TABLE]
Since has bounded moments, from the Berry-Esseen theorem, we know that there exists a constant such that for all large enough
[TABLE]
where is in (88) above, and are positive constants. In the last step of the program, we control the difference between and . From (265)–(266), we have
[TABLE]
For , we have , and . This implies that the summands in (270) for are both of the order for any . For , the condition (89) and the variance perturbation result in Theorem 12 imply that every summand in (270) for is in the order of . Hence, (270) is in the order of . Finally, combining (269) and (270) implies that conditioning on the conditions (89) and (90), we conclude that (91) is violated with probability at most . ∎
F-D Auxiliary Lemmas
Lemma 7** (Lower bound on the probability of distortion balls).**
Fix . For any large enough and any defined in Definition 3 in Section IV-F above, and defined by
[TABLE]
for a constant specified in (299), below, it holds that
[TABLE]
where is a constant and is in Appendix F-A above.
The proof is in Appendix F-F.
Lemma 8**.**
Fix and . There exists constants and such that for all large enough,
[TABLE]
where and are defined in (226) and (228), respectively.
Proof.
The proof of Lemma 8 is the same as [48, Eq. (314)–(333)] except that we strengthen the right side of [48, Eq. (322)] to be for a constant due to Corollary 1. ∎
F-E Proof of Lemma 4
Using Lemmas 7 and 8 in Appendix F-D above, the proof of Lemma 4 is almost the same as that in the stationary case [48, Eq. (270)-(278)]. For completeness, we sketch the proof here. We weaken the bound [26, Lem. 1] by setting as and as to obtain that for any ,
[TABLE]
where in (228) depends on . Let denote the event inside the square brackets in (81). Then,
[TABLE]
where
- •
(276) is due to (274) and Lemma 7;
- •
From (276) to (277), we used the fact that for , can be bounded by
[TABLE]
where is a constant and is given by (54). The bound (279) is obtained by the same argument as that in the stationary case [48, Eq. (273)]; is chosen in (271) above; the constants ’s, in (82) are chosen as
[TABLE]
where is given in (299) below, and and are the constants in Lemmas 7 and 8, respectively.
- •
(278) is due to Lemma 8 and Theorem 8.
∎
F-F Proof of Lemma 7
Proof.
The proof is similar to the stationary case [48, Lem. 10]. We streamline the proof and point out the differences. Conditioned on , the random variable
[TABLE]
follows a noncentral -distribution with (at most) degrees of freedom, since it is shown in [48, Eq. (282) and Lem. 4] that conditioned on , the distribution of the random variable is given by
[TABLE]
where ’s are given in (231). Then, the conditional expectation is given by
[TABLE]
where is defined in (87) in Section IV-E above. In view of (284), (286) and (91), we expect that concentrates around conditioned on for . Note that the proof of Theorem 8 related to (91) is different from the one in the stationary case, see Appendix F-C above for the details. To simplify notations, we denote the variances as
[TABLE]
Due to (285) and (91), we see ’s have finite second- and third- order absolute moments. That is, we have
[TABLE]
for . Therefore, we can apply the Berry-Esseen theorem. Hence,
[TABLE]
where
- •
(291) follows from the Berry-Esseen theorem; is a constant, and
[TABLE]
is the cumulative distribution function of the standard Gaussian distribution;
- •
(292) is due to the mean value theorem and
[TABLE]
- •
In (292), satisfies
[TABLE]
By (91) and (289), we see that there is a constant such that
[TABLE]
Hence, as long as in (295) satisfies
[TABLE]
where is defined in (88), there exists a constant such that
[TABLE]
Let be a constant such that
[TABLE]
and choose as in (271), which satisfies (297). Then, plugging the bounds (289), (298), (299) and (271) into (292), we conclude that there exists a constant such that (292) is further bounded from below by . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Tian and V. Kostina, “From parameter estimation to dispersion of nonstationary Gauss-Markov processes,” in Proceedings of 2019 IEEE International Symposium on Information Theory , Paris, France, July 2019, pp. 2044–2048.
- 2[2] H. B. Mann and A. Wald, “On the statistical treatment of linear stochastic difference equations,” Econometrica, Journal of the Econometric Society , vol. 11, no. 3, pp. 173–220, July 1943.
- 3[3] H. Rubin, “Consistency of maximum likelihood estimates in the explosive case,” Statistical Inference in Dynamic Economic Models , pp. 356–364, Jan. 1950.
- 4[4] J. S. White, “The limiting distribution of the serial correlation coefficient in the explosive case,” The Annals of Mathematical Statistics , pp. 1188–1197, Dec. 1958.
- 5[5] T. W. Anderson, “On asymptotic distributions of estimates of parameters of stochastic difference equations,” The Annals of Mathematical Statistics , pp. 676–687, Sep. 1959.
- 6[6] J. Rissanen and P. Caines, “The strong consistency of maximum likelihood estimators for ARMA processes,” The Annals of Statistics , pp. 297–315, Mar. 1979.
- 7[7] N. H. Chan and C.-Z. Wei, “Asymptotic inference for nearly nonstationary AR(1) processes,” The Annals of Statistics , pp. 1050–1063, Sep. 1987.
- 8[8] B. Bercu, F. Gamboa, and A. Rouault, “Large deviations for quadratic forms of stationary Gaussian processes,” Stochastic Processes and their Applications , vol. 71, no. 1, pp. 75–90, Oct. 1997.
