On the Information Dimension of Stochastic Processes
Bernhard C. Geiger, Tobias Koch

TL;DR
This paper extends the concept of information dimension to stochastic processes, linking it to rate-distortion theory and spectral properties, and characterizes the maximum information dimension rate among Gaussian processes.
Contribution
It introduces the information dimension rate for stochastic processes, establishes its equivalence with the rate-distortion dimension, and characterizes it for Gaussian processes based on spectral properties.
Findings
Information dimension rate equals the rate-distortion dimension.
Gaussian processes maximize the information dimension rate among stationary processes.
The information dimension rate of Gaussian processes depends on the average rank of the spectral derivative.
Abstract
In 1959, R\'enyi proposed the information dimension and the -dimensional entropy to measure the information content of general random variables. This paper proposes a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate of the uniformly-quantized stochastic process divided by minus the logarithm of the quantizer step size in the limit as . It is demonstrated that the information dimension rate coincides with the rate-distortion dimension, defined as twice the rate-distortion function of the stochastic process divided by in the limit as . It is further shown that, among all multivariate stationary processes with a given (matrix-valued) spectral distribution function (SDF), the Gaussian process has the largest information dimension rate, and that the information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
On the Information Dimension
of Stochastic Processes
Bernhard C. Geiger, and Tobias Koch The work of Bernhard C. Geiger has partly been funded by the Erwin Schrödinger Fellowship J 3765 of the Austrian Science Fund and by the German Ministry of Education and Research in the framework of an Alexander von Humboldt Professorship. The Know-Center is funded within the Austrian COMET Program - Competence Centers for Excellent Technologies - under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Digital and Economic Affairs, and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG. The work of Tobias Koch has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 714161), from the 7th European Union Framework Programme under Grant 333680, from the Ministerio de Economía y Competitividad of Spain under Grants TEC2013-41718-R, RYC-2014-16332, and TEC2016-78434-C3-3-R (AEI/FEDER, EU), and from the Comunidad de Madrid under Grant S2103/ICE-2845. This work has been presented in part at the 2017 IEEE International Symposium on Information Theory, Aachen, Germany, June 2017, and at the 2018 International Zurich Seminar on Information and Communication, Zurich, Switzerland, February 2018.Bernhard C. Geiger is with Know-Center GmbH, 8010, Graz, Austria (e-mail:[email protected]).Tobias Koch is with the Signal Theory and Communications Department, Universidad Carlos III de Madrid, 28911, Leganés, Spain and also with the Gregorio Marañón Health Research Institute, 28007, Madrid, Spain (e-mail:[email protected]).Copyright (c) 2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].
Abstract
In 1959, Rényi proposed the information dimension and the -dimensional entropy to measure the information content of general random variables. This paper proposes a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate of the uniformly-quantized stochastic process divided by minus the logarithm of the quantizer step size in the limit as . It is demonstrated that the information dimension rate coincides with the rate-distortion dimension, defined as twice the rate-distortion function of the stochastic process divided by in the limit as . It is further shown that, among all multivariate stationary processes with a given (matrix-valued) spectral distribution function (SDF), the Gaussian process has the largest information dimension rate, and that the information dimension rate of multivariate stationary Gaussian processes is given by the average rank of the derivative of the SDF. The presented results reveal that the fundamental limits of almost zero-distortion recovery via compressible signal pursuit and almost lossless analog compression are different in general.
Index Terms:
Entropy, Gaussian process, information dimension, rate-distortion dimension
I Introduction
In 1959, Rényi [1] proposed the information dimension and the -dimensional entropy to measure the information content of general random variables (RVs). His idea was to quantize the RV by a uniform quantizer of step size , and to then analyze the entropy of the quantized RV in the limit as tends to infinity. Assuming that the entropy exists and the asymptotic expansion
[TABLE]
holds for (where refers to remainder terms that vanish as ), Rényi referred to as the information dimension and to as the -dimensional entropy.
In recent years, it was shown that the information dimension is of relevance in various areas of information theory, including rate-distortion theory, almost lossless analog compression, or the analysis of interference channels. For example, Kawabata and Dembo [2] showed that the information dimension of a RV is equal to its rate-distortion dimension, defined as twice the rate-distortion function divided by in the limit as . Koch [3] demonstrated that the rate-distortion function of a source with infinite information dimension is infinite, and that for any source with finite information dimension and finite differential entropy the Shannon lower bound on the rate-distortion function is asymptotically tight. Wu and Verdú [4] analyzed linear encoding and Lipschitz decoding of discrete-time, independent and identically distributed (i.i.d.), stochastic processes and showed that the information dimension plays a fundamental role in achievability and converse results. Wu et al. [5] showed that the degrees of freedom of the -user Gaussian interference channel can be characterized through the sum of information dimensions. Stotz and Bölcskei [6] generalized this result to vector interference channels.
Jalali and Poor [7] proposed a generalization of information dimension to stationary, discrete-time, stochastic processes by defining the information dimension of the stochastic process as the information dimension of divided by in the limit as .111More precisely, Jalali and Poor define the information dimension of a stochastic process via a conditional entropy of the uniformly-quantized process. For stationary processes, their definition coincides with the above-mentioned definition [7, Lemma 3]. They showed that, for -mixing processes, the information dimension is an achievable rate for universal compressed sensing with linear encoding and decoding via Lagrangian minimum entropy pursuit [7, Th. 8]. Rezagah et al. [8] showed that coincides, under certain conditions, with the rate-distortion dimension , thus generalizing the result by Kawabata and Dembo [2] to stochastic processes. Other notions of information dimensions for stochastic processes are discussed in [9].
In this paper, we propose a different definition for the information dimension of stationary, discrete-time, stochastic processes. Specifically, let denote the stochastic process uniformly quantized with step size . We define the information dimension rate of as the entropy rate of divided by in the limit as . For i.i.d. processes, our definition coincides with that of Jalali and Poor (and, in fact, evaluates to Rényi’s information dimension of the marginal RV ). More generally, we show that these definitions are equivalent for -mixing processes. Nevertheless, there are stochastic processes for which the two definitions disagree. In particular, we derive a closed-form expression for the information dimension rate of stationary, multivariate, Gaussian processes with power spectral density (PSD) , which specialized to the univariate case yields that is equal to the Lebesgue measure of the set of harmonics on where is positive. For Gaussian processes with a bandlimited PSD, this implies that the information dimension rate is equal to twice the PSD’s bandwidth. This is consistent with the intuition that for such processes not all samples contain information. For example, if the bandwidth of the PSD is , then we expect that half of the samples in can be expressed as linear combinations of the other samples and, hence, do not contain information. In contrast, we show that the information dimension is if is positive on any set with positive Lebesgue measure. In other words, does not capture the dependence of the information dimension on the support size of .
By emulating the proof of [2, Lemma 3.2], we further show that, for any stochastic process , the information dimension rate coincides with the rate-distortion dimension . This implies that coincides with only for those stochastic processes for which .
The rest of this paper is organized as follows. In Section II, we introduce the notation used in this paper. In Section III, we present preliminary results on the Rényi information dimension of RVs and random vectors. In Section IV, we present our definition of the information dimension rate of a stochastic process, discuss its connection to the rate-distortion dimension, and compute the information dimension rate of stationary Gaussian processes. In Section V, we review the information dimension proposed by Jalali and Poor and discuss its relation to . In Section VI, we briefly discuss the operational meanings of information dimension in compressed sensing and zero-distortion recovery. Section VII concludes the paper with a discussion of the obtained results. Some of the proofs are deferred to the appendices.
II Notation and Preliminaries
We denote by , , and the set of real numbers, the set of complex numbers, and the set of integers, respectively. We further denote by and the set of nonnegative real numbers and the set of positive integers, respectively. We use a calligraphic font, such as , to denote other sets, and we denote complements as . The set difference between two sets and is written as .
The real and imaginary parts of a complex number are denoted as and , respectively, i.e., where . The complex conjugate of is denoted as .
We use uppercase letters to denote deterministic matrices and boldface lowercase letters to denote deterministic vectors. The transpose of a vector or matrix is denoted by , the Hermitian transpose by . The determinant and rank of a matrix are and , respectively. We denote by the identity matrix.
We denote RVs by uppercase letters, e.g., . For a finite or countably infinite collection of RVs we abbreviate , , and .222If , then is the empty set. Random vectors are denoted by boldface uppercase letters, e.g., . Univariate discrete-time stochastic processes are denoted as or, in short, as . For -variate stochastic processes we use the same notation but with replaced by . We call a component process.
We denote the probability measure of the RV by . If is absolutely continuous with respect to (w.r.t.) the Lebesgue measure, then we denote its probability density function (PDF) as . We denote by a Gaussian RV with the same mean and variance as , and we denote the corresponding Gaussian density as .
We define the quantization of a real-valued RV with precision as
[TABLE]
where is the largest integer less than or equal to . Likewise, denotes the smallest integer greater than or equal to . We denote by the component-wise quantization of (and similarly for other finite or countably infinite collections of RVs and random vectors). For complex RVs with real part and imaginary part , the quantization is equal to . We define as the -dimensional hypercube in , with its bottom-left corner at and with sidelength . For example, we have that if .
Let , , and denote entropy, differential entropy, and relative entropy, respectively, and let denote mutual information [10]. We take logarithms to base , so mutual informations and entropies have dimension nats. The entropy rate of a discrete-valued, stationary, -variate process is [10, Sec. 4.2]
[TABLE]
Note that the stationarity of guarantees that the limit in (3) exists and is equal to [10, Th. 4.2.1]
[TABLE]
We say that a stationary process is -mixing if
[TABLE]
where the supremum is over all and satisfying , and where and are the -fields generated by and , respectively. The -mixing property implies that is information regular, i.e., [11, pp. 111-112]
[TABLE]
III Rényi Information Dimension
The Rényi information dimension of a collection of RVs is defined as [1]
[TABLE]
When the limit does not exist, we say that the information dimension does not exist. In this case, one may replace the limit either by the limit superior or by the limit inferior (denoted as and , respectively)
[TABLE]
and call and the upper and lower information dimension of , respectively. Clearly,
[TABLE]
if the limit in (7) exists.
We shall follow this notation throughout the document. Specifically, when reporting results in connection with limits, an overline indicates that the quantity in the brackets has been computed using the limit superior, an underline indicates that it has been computed using the limit inferior, both an overline and an underline indicates that a result holds irrespective of whether the limit superior or limit inferior is taken. We write no lines if the limit exists.
Definition 1
For two RVs and with joint probability measure , the conditional information dimension is defined as
[TABLE]
provided the limit exists. If the limit does not exist, then we define the upper and lower conditional information dimension and by replacing the limit with the limit superior and the limit inferior, respectively.
III-A Properties of Information Dimension
The information dimension of a collection is bounded by the number of RVs in the collection, given the integer part of this collection has finite entropy.
Lemma 1** ([1, eq. (7)], [4, Prop. 1])**
Let be a collection of real-valued RVs. If , then
[TABLE]
If , then .
Trivially, if is a collection of discrete RVs satisfying , then . Moreover, if the joint distribution of is absolutely continuous w.r.t. the Lebesgue measure on and if , then [1, Th. 4]. More generally, Rényi claims that the information dimension of equals if the joint distribution of is absolutely continuous on some sufficiently smooth -dimensional manifold in [1, p. 209]. Furthermore, if is a real-valued RV satisfying and with probability measure
[TABLE]
where is a discrete measure, is an absolutely-continuous measure, and , then [1, Th. 3]
[TABLE]
Two well-known properties of entropy are that it is reduced by conditioning [10, Th. 2.6.5] and that it obeys a chain rule. Furthermore, the conditional entropy of given can be computed by first calculating the entropy conditioned on the event that , and by then averaging over . The corresponding results for information dimension are presented in the following three lemmas.
Lemma 2
Suppose that . Then, we have for any two RVs and
[TABLE]
Consequently, if exists -almost surely, then the limit in (10) exists and
[TABLE]
Proof:
See Appendix A-A. ∎
Lemma 3
For any two RVs and , we have
[TABLE]
with equality if and are independent.
Proof:
Since conditioning reduces entropy, we have , with equality if and are independent. The lemma follows by dividing both sides of the inequality by and taking limits as . ∎
Lemma 4
For the collection of RVs , we have
[TABLE]
Proof:
See Appendix A-B. ∎
The left-most inequality in (17) holds with equality if all information dimensions exist and the RVs are independent. There are examples where the right-most inequality is strict.
Example 1
Let be uniformly distributed on and let , where is bijective. Such a function can be constructed (see also the discussion in [4, Section IV.B]). Since is bijective, we have . Moreover, since is uniformly distributed on , we have . Finally, we have by Lemma 1. From Lemma 4, we get
[TABLE]
However, we also have
[TABLE]
It follows that
[TABLE]
so the chain rule holds with strict inequality.
The above example not only demonstrates that the chain rule for information dimension may hold with strict inequality, it also shows that the order in which the chain rule is expanded can be crucial.
III-B Information Dimension of Finite-Variance RVs
For RVs that have a finite variance, the upper bound on presented in Lemma 1 can be tightened. To this end, we introduce further notation. We denote the covariance matrix of the vector by . Furthermore, the cross-covariance matrix between and is denoted by , and the covariance matrix of the vector is denoted by . Clearly,
[TABLE]
One can show that the information dimension of a collection of real-valued RVs cannot exceed the rank of its covariance matrix, i.e.,
[TABLE]
This agrees with the intuition that linearly-dependent components of do not contribute to the information dimension. One can further show that collections of Gaussian RVs achieve this upper bound with equality. Thus, among all RVs with a given covariance structure, the Gaussian RV maximizes information dimension. These results follow directly from the more general results for stochastic processes (Theorem 10) in Section IV.
The next theorem evaluates the conditional information dimension of given for jointly Gaussian RVs .
Theorem 5
Let be a collection of real-valued, jointly Gaussian RVs. The conditional information dimension of given is equal to
[TABLE]
where is the generalized Schur complement of in .
Proof:
See Appendix A-C. ∎
Theorem 5 implies that the chain rule in Lemma 4 holds with equality for Gaussian RVs. Indeed, if is a collection of real-valued, jointly Gaussian RVs, then we have and . Moreover, by Theorem 5, equals the rank of the generalized Schur complement of in , denoted by . Since the rank of can be written as the sum of the ranks of and [12, 7.1.P28], the claim follows.
IV The Information Dimension Rate
We next propose the information dimension rate as a generalization of information dimension to stochastic processes. We define the information dimension rate for general (possibly non-stationary) processes. However, for the sake of simplicity, most of our results will then be presented for stationary processes.
Definition 2
The information dimension rate of the -variate stochastic process is defined as
[TABLE]
provided the limits exist. If the limits do not exist, then we define the upper and lower information dimension rate and by replacing the limits with the limits superior and limits inferior, respectively.
IV-A Properties of the Information Dimension Rate
The information dimension rate satisfies properties similar to those presented in Lemma 1 for the information dimension. We summarize them in the following lemma.
Lemma 6
Let be a stationary, -variate, real-valued process. If , then
[TABLE]
If , then .
Proof:
Suppose first that . Then, the rightmost inequality in (25) follows from (11). The left-most inequality follows from the nonnegativity of entropy. Finally, the center inequality follows since conditioning reduces entropy, hence .
Now suppose that . Since is a function of for every , we have
[TABLE]
This implies that , and the claim that follows from Definition 2. ∎
The next result discusses how Lipschitz transformations affect the information dimension rate.
Lemma 7
Let be a stationary, -variate, real-valued process, and let be a sequence of Lipschitz functions from to with Lipschitz constants satisfying
[TABLE]
Then,
[TABLE]
Proof:
See Appendix B. ∎
If is a sequence of bi-Lipschitz functions with uniformly-bounded Lipschitz constants, then Lemma 7 implies that . As a corollary, we thus obtain that the information dimension rate is invariant under scaling and translation. More generally, it follows that, if and are sequences of -variate vectors and -dimensional matrices, the latter satisfying and for some induced matrix norm , then
[TABLE]
Since the information dimension rate of an i.i.d. process equals the information dimension of its marginal RVs, we further recover the well-known result that the information dimension of collections of RVs is invariant under scaling and translation [13, Lemma 3].
The next lemma shows that the information dimension rate of a collection of stochastic processes is unaffected by those that have zero information dimension rate.
Lemma 8
Let and be two jointly stationary, -variate, real-valued processes, and assume that . Then,
[TABLE]
Moreover, if is discrete with , then we further have
[TABLE]
Proof:
See Appendix C. ∎
Inter alia, Lemma 8 can be used to compute the information dimension rate of a countable mixture of stochastic processes. For example, specialized to i.i.d. processes, (32) together with Lemma 2 recovers (13) by choosing , , and .
IV-B Information Dimension Rate vs. Rate-Distortion Dimension
Let denote the rate-distortion function of the source , i.e.,
[TABLE]
where the infimum is over all conditional distributions of given such that
[TABLE]
and where denotes the Euclidean norm. We have the following definition.
Definition 3
The rate-distortion dimension of the -variate stochastic process is defined as
[TABLE]
provided the limits over and exist. (When the process is stationary, the limit over always exists [14, Th. 9.8.1].) If the limits do not exist, then we define the upper and lower rate-distortion dimension and by replacing the limits with the limits superior and limits inferior, respectively.
Intuitively, the rate-distortion function
[TABLE]
corresponds to the minimum number of nats per source symbol required to compress a stationary and ergodic source with a vector quantizer of average per-symbol distortion not exceeding [14, Sec. 9.8]. The rate-distortion dimension characterizes the growth of as vanishes. For example, for an i.i.d. Gaussian source with variance , we have [10, Th. 13.3.2]
[TABLE]
where denotes the indicator function. Observe that in this case grows like as . The rate-distortion dimension corresponds to twice the pre-log factor of the rate-distortion function , which in this case is .
In contrast, the information dimension rate characterizes the growth of the entropy rate as increases. This entropy rate, in turn, corresponds essentially to the number of nats per source symbol required to compress each symbol of a stationary and ergodic source with a uniform quantizer of step size . Since a symbol-wise, uniform quantizer cannot outperform the best vector quantizer, it follows that the information dimension rate is lower-bounded by the rate-distortion dimension.
For RVs, Kawabata and Dembo showed that the rate-distortion dimension is actually equal to its information dimension [2, Prop. 3.3]. Thus, a symbol-wise, uniform quantizer achieves the same information dimension as the best vector quantizer. The following theorem generalizes this result to stochastic processes.
Theorem 9
For any -variate, real-valued process ,
[TABLE]
Proof:
See Appendix D. ∎
Note that Theorem 9 also holds for non-stationary processes.
IV-C Information Dimension Rate of Finite-Variance Processes
Let be a stationary, -variate, real-valued process with mean vector and (matrix-valued) spectral distribution function (SDF) . Thus, is a bounded, non-decreasing, and right-continuous function on such that the autocovariance function
[TABLE]
is given by the Lebesgue-Stieltjes integral [15, (7.3), p. 141]
[TABLE]
It follows that the -th element of is the cross SDF of the component processes and , i.e.,
[TABLE]
where
[TABLE]
denotes the cross-covariance function. It further follows that the diagonal elements of are real and non-decreasing, and they satisfy , where denotes the standard deviation of . It can be shown that has a derivative almost everywhere, which has positive semi-definite, Hermitian values [15, (7.4), p. 141]. We shall denote the derivative of by . When is absolutely continuous w.r.t. the Lebesgue measure, its derivative coincides with the PSD of .
The following theorem shows that, among all processes of a given SDF, the Gaussian process maximizes the information dimension rate. It further characterizes the information dimension rate of such processes in terms of the SDF.
Theorem 10
Let be a stationary, -variate, real-valued process with SDF . Then,
[TABLE]
with equality if is Gaussian.
Proof:
See Appendix E. ∎
In order to prove Theorem 10, we invoke Bussgang’s theorem to obtain an expression for the SDF of a quantized Gaussian process as a function of the SDF of the original process . Since we believe that this result is interesting on its own, we present it below.
Lemma 11
Let be a stationary, -variate, real-valued, Gaussian process with mean vector and SDF . Then, the -th entry of the SDF of satisfies
[TABLE]
where and
[TABLE]
(In (45), and denote the mean and standard deviation of .) For every , we have
[TABLE]
and
[TABLE]
Moreover, if all component processes have zero mean and unit variance, then and
[TABLE]
Proof:
See Appendix F. ∎
As a corrolary to Theorem 10, we obtain that for univariate, stationary, Gaussian processes with PSD , the information dimension rate is equal to the Lebesgue measure of the set of harmonics on where is positive, i.e.,
[TABLE]
where denotes the Lebesgue measure. As pointed out by one of the reviewers, (49) can also be obtained directly by using the equivalence of information dimension rate and rate-distortion dimension (Theorem 9) together with the parametric representation of the rate-distortion function [14, eqs. (9.7.42) & (9.7.43)]
[TABLE]
for , where . Indeed, when is zero, we have since in this case the process has zero variance and, hence, the entropy rate of the quantized process is zero, too. When is strictly positive, the distortion can be bounded as
[TABLE]
It follows by the continuity of the Lebesgue measure that as . Consequently, if, and only if, and the rate-distortion dimension can be written as
[TABLE]
By the continuity of the Lebesgue measure, for every there exists a such that . Since , it follows that
[TABLE]
Thus, for every ,
[TABLE]
Dividing both sides of (55) by , and letting first and then tend to zero, we obtain that the second term on the RHS of (53) is nonnegative. However, by assumption the process has finite variance, so its PSD is integrable over . Consequently, using the inequality and the nonnegativity of , we obtain that
[TABLE]
Dividing both sides of (56) by , and letting tend to zero, we obtain that the second term on the RHS of (53) is also nonpositive. We conclude that this term is zero, so (49) follows from (53) and Theorem 9.
We observe from Theorem 10 that the information dimension rate of a Gaussian process depends only on the derivative of its SDF , which coincides almost everywhere with the derivative of the absolutely-continuous part of . Indeed, any SDF can be decomposed as [15, (4.3), p. 124]
[TABLE]
where is absolutely continuous w.r.t. the Lebesgue measure, is discrete, and is singular. Furthermore, almost everywhere [15, Sec. 4]. Consequently, the information dimension rate of a Gaussian process depends only on the absolutely-continuous part of its SDF. By combining (57) with Theorem 10 and Lemma 8, we can show that the same is true for non-Gaussian processes.
Corollary 12
Let be a stationary, -variate, real-valued process with SDF , and let be a stationary, -variate, real-valued process with SDF , where is the absolutely-continuous part of , cf. (57). Then
[TABLE]
Proof:
Combining the decomposition (57) with the spectral representation of stationary processes [16, Sec. 4.11], it can be shown that every stationary process can be written as
[TABLE]
where , , and are stationary, mutually uncorrelated, stochastic processes with the respective SDFs , , and ; see [16, p. 758] and references therein. Since and are zero almost everywhere [15, Sec. 4], we obtain from Theorem 10 and the nonnegativity of the information dimension rate (Lemma 6) that
[TABLE]
Corollary 12 follows by applying Lemma 8 first together with (60) to show that
[TABLE]
and then together with (61) to show that
[TABLE]
∎
IV-D Information Dimension Rate of Complex-Valued Processes
So far, we have considered real-valued stochastic processes. However, every complex-valued RV can be written as a two-dimensional, real-valued, random vector, so the previous results directly generalize to the complex case. In particular, one can define the information dimension rate of the -variate, complex-valued process as the information dimension rate of the -variate, real-valued process that follows by stacking the real part of on top of the imaginary part of .
Let be a stationary, -variate, complex-valued process with mean vector and matrix-valued SDF , i.e.,
[TABLE]
where
[TABLE]
is the autocovariance function. We say that a stationary, -variate, complex-valued process is proper if it has finite variance, its mean vector is the zero vector, and its pseudo-autocovariance function satisfies
[TABLE]
The following result generalizes Theorem 10 to complex-valued stochastic processes.
Theorem 13
Let be a stationary, -variate, complex-valued process with matrix-valued SDF . Then,
[TABLE]
with equality if is Gaussian and proper.
Proof:
See Appendix G. ∎
Note that neither Gaussianity nor properness is sufficient for equality in Theorem 13. Conversely, Gaussianity and properness are not necessary for equality. For example, any univariate stationary Gaussian process achieves (66) with equality if its real and imaginary components are independent and if the derivatives of their SDFs have matching support.
V Another Definition of Information Dimension
Jalali and Poor [7] proposed a different definition for the information dimension of a univariate stochastic process. We shall refer to this information dimension as the block-average information dimension and denote it by . In this section, we discuss scenarios in which the information dimension rate (Definition 2) coincides with and differs from the block-average information dimension. For ease of exposition, in this section we follow [7] and restrict our attention to univariate real-valued processes.
The following definition for the information dimension of stochastic processes was proposed in [7].
Definition 4
The block-average information dimension of the stochastic process is defined as
[TABLE]
provided the limits exist. If the limits do not exist, then one can define the upper and lower block-average information dimension and by replacing the limits by limits superior and limits inferior, respectively.
In the following, we restrict ourselves to stationary processes, in which case the limit over in (67) is guaranteed to exist. We refer to as the block-average information dimension because it was shown in [7, Lemma 3] that, if is stationary and the information dimension exists for every , then
[TABLE]
If does not exist, then the proof of [7, Lemma 3] reveals that
[TABLE]
Since conditioning reduces entropy, it follows immediately that
[TABLE]
Thus, like the information dimension rate, the block-average information dimension of the stochastic process cannot exceed the information dimension of the marginal RV .
While the entropy rate of a stationary process can alternatively be written as the conditional entropy of given , cf. (4), the block-average information dimension does, in general, not permit a similar expression. In fact, let
[TABLE]
provided the limit over exists. (Since conditioning reduces entropy, the limit over always exists.) The upper and lower information dimensions and are defined analogously by replacing the limit over by the limit superior and limit inferior, respectively. Then, we have that
[TABLE]
where the inequality can be strict; see Theorem 14 and Example 4 below.
V-A *Block-Average Information Dimension vs.
Information Dimension Rate*
We next demonstrate that, for -mixing processes, the information dimension rate coincides with the block-average information dimension . However, in general the two definitions do not coincide, but there exists an ordering between them.
Theorem 14
Let be a stationary process. Then,
[TABLE]
Moreover,
[TABLE]
where the limits over exist because, by the stationarity of , the mutual information is monotonically decreasing in .
Proof:
See Appendix H-A. ∎
The inequalities in (74) imply that, if the limits over exist, then
[TABLE]
is a necessary and sufficient condition for the equality of and . Note that, for every , we have [17, eq. (8.9)]
[TABLE]
Thus, (75) is satisfied for processes that allow us to change the order of taking limits as and tend to infinity. However, in general (75) is difficult to check. We next present a sufficient condition that is easier to verify.
Corollary 15
Let be a stationary process. Assume that there exists a nonnegative integer such that
[TABLE]
Then, .
Proof:
See Appendix H-B. ∎
Condition (77) holds for -mixing processes. Indeed, since every -mixing process satisfies (6), it follows that one can find an such that . The condition (77) holds then by the data processing inequality.
If (77) holds for , then we even have that
[TABLE]
Thus, in this case all presented generalizations of information dimension to stochastic processes coincide with the information dimension of the marginal RV. To prove (78), we note that (77) with gives
[TABLE]
It then follows by the data processing inequality that
[TABLE]
Consequently,
[TABLE]
if the limit exists. In general, we have . The claim (78) follows then by (73) and because, by (70), .
Condition (77) with is satisfied, for example, if is a sequence of i.i.d. RVs, if it is a discrete-valued stochastic process with finite marginal entropy, or if it is a continuous-valued stochastic process with finite marginal differential entropy and finite differential entropy rate.
In the following, we present two examples of processes for which . As we shall argue, neither of these examples satisfies (77), hence (77) is sufficient but not necessary.
Example 2
Let be a sequence of i.i.d. Bernoulli- RVs, i.e., , and let be a sequence of i.i.d. RVs with PDF supported on and finite differential entropy. By (13), we thus have that for every . We define the stochastic process as
[TABLE]
and assume that has the same marginal distribution as . Note that is first-order Markov, so
[TABLE]
Furthermore, [7, Th. 3] demonstrates that . Thus, together with (73), this yields that
[TABLE]
The stochastic process , as defined by (82), satisfies (75) but not (77). Indeed, for every nonnegative integer , we have , since has finite differential entropy and the event has positive probability. It follows that for every and , so (77) is violated. In contrast, we have
[TABLE]
since conditioning on the binary random variable changes mutual information by at most one bit. If , then ; if , then , which is independent of . In both cases, the conditional mutual information between and given is zero, so (75) is satisfied.
Example 3
Let the process be periodic with period and have finite marginal differential entropy. Further let be uniformly distributed on . Then, the shifted process , defined by
[TABLE]
is stationary [18, Th. 10-5] and has finite marginal differential entropy. For every and , we have that and , hence
[TABLE]
As in the previous example, the stochastic process satisfies (75) but not (77). Indeed, for every nonnegative integer , we have since has finite differential entropy and the process is periodic. In contrast, , so the conditional mutual information between and given is zero when
In many cases, the inequalities in Theorem 14 can be strict. The following example shows such a strict inequality for the class of stationary Gaussian processes with PSD supported on a set of positive Lebesgue measure.333The assumption that has a PSD is made for notational convenience and is not essential. All steps in Example 4 continue to hold if we replace by the derivative of the SDF .
Example 4
Let be a stationary Gaussian process with zero mean, variance , and PSD having support . It follows from Theorem 10 that
[TABLE]
We next argue that if then and . Consequently,
[TABLE]
To show that , we note that
[TABLE]
where the inequality follows by the stationarity of ; because conditioning reduces entropy; and because, conditioned on , is independent of . Since is Gaussian, it follows that, conditioned on , the RV is Gaussian with mean and variance , which is independent of . It can be further shown that if , then for every finite (see Lemma 16 below). It follows that, conditioned on any , the RV has a PDF, so by (13)
[TABLE]
Together with Fatou’s lemma, this shows that the RHS of (90) is , hence .
To demonstrate that , we note that implies that
[TABLE]
This is a necessary and sufficient condition for as ; see, e.g., [19, Sec. 10.6]. Intuitively, the fact that implies that the conditional distribution of given is almost surely degenerate, hence . To prove this rigorously, we apply [13, Lemma 30] together with the fact that conditioning reduces entropy to upper-bound
[TABLE]
Expressing as , where is zero-mean, unit-variance Gaussian, the RHS of (93) can be written as . Since as , we obtain from [3, Lemma 1] that
[TABLE]
Consequently, the claim follows from the definition of .
Lemma 16
Let be a stationary, univariate, real-valued, Gaussian process with zero mean, variance , and SDF . Suppose that for some finite . Then,
[TABLE]
Proof:
See Appendix H-C. ∎
V-B *Block-Average Information Dimension vs.
Rate-Distortion Dimension*
The connection between the block-average information dimension and the rate-distortion dimension of a stochastic process was studied in [8]. The equivalence between the rate-distortion dimension and the information dimension [2, Prop. 3.3] directly implies that
[TABLE]
Rezagah et al. [8] demonstrated that the order of the limits on the RHS of (96) can be exchanged. More precisely, [8, Th. 2] states that if exists for all , then
[TABLE]
This may appear as a contradiction to our results, since we demonstrate in Theorem 9 that , and Example 4 demonstrates that there are stochastic processes for which . However, the proof of (97) relies on the fact that [8, Sec. VI-E]
[TABLE]
and that the RHS of (98) vanishes as . If (79) holds, then this is indeed the case; see [17, eqs. (8.6)–(8.10)]. As shown in Corollary 15, in this case we also have that . In fact, as discussed after Corollary 15, in this case all presented generalizations of information dimension to stochastic processes coincide with the information dimension of the marginal RV. In contrast, if (79) does not hold then, by the data processing inequality, the RHS of (98) is infinite. This is, for example, the case if is a stationary process with positive variance and a PSD that is zero on a set of positive Lebesgue measure, since for such processes the differential entropy is . Our proof of Theorem 9 does not rely on (98). We thus conclude that for every stochastic process , but that only for those processes for which .
VI Operational Characterizations
Information dimension was recently given an operational characterization in almost lossless data compression [4]. Specifically, Wu and Verdú defined the minimum -achievable rate to be the minimum of such that there exists a sequence of encoders and decoders satisfying [4, Def. 4]
[TABLE]
for all sufficiently large. As argued in [4, Sec. IV-B], if we impose no restrictions on and , then zero rate is achievable even for , since the cardinality of is the same for any . However, if we restrict ourselves either to encoders that are linear or to decoders that are Lipschitz continuous, then the minimum -achievable rate for collections of i.i.d. RVs with a discrete-continuous mixed distribution, i.e., a distribution of the form (12), is given by
[TABLE]
Thus, for such RVs, information dimension has an operational characterization.
For stochastic processes , Wu and Verdú further demonstrated that the minimum -achievable rate, achievable with Lipschitz-continuous decoders , can be lower-bounded as [20, Remark 4]
[TABLE]
To the best of our knowledge, for non-i.i.d. processes , no matching achievability result exists for almost lossless data compression.
In contrast, for universal compressed sensing with linear encoding and decoding via Lagrangian minimum entropy pursuit, it was shown by Jalali and Poor that is an achievable rate when is -mixing:
Theorem 17** ([7, Th. 8])**
Consider a -mixing stationary process taking value in with upper block-average information dimension . For each , let the entries of the measurement matrix be drawn i.i.d. according to a zero-mean, unit-variance, Gaussian distribution. Given generated by and , let
[TABLE]
where , is the conditional empirical entropy [7, Def. 1], (for ), , and . If the number of measurements satisfies
[TABLE]
then
[TABLE]
as .
In words, Theorem 17 states that if the rate of random linear measurements of is slightly larger than the block-average information dimension, then the Lagrangian relaxation of minimum entropy pursuit provides an asymptotically distortion-free estimate of in terms of the Euclidean norm. Thus, for -mixing processes, the block-average information dimension is an achievable rate for almost zero-distortion recovery.
We next discuss an operational characterization of the rate-distortion dimension. By Theorem 9, this is also an operational characterization of the information dimension rate. In [8], Rezagah et al. considered the almost zero-distortion recovery of stationary processes when the decoder employs compressible signal pursuit (CSP) optimization:
Theorem 18** ([8, Cor. 2])**
Consider a stationary, real-valued process and a system of random linear observations with measurement matrix composed of i.i.d. zero-mean, unit-variance, Gaussian RVs. If the number of measurements satisfies
[TABLE]
then there exists a family of compression codes such that
[TABLE]
as , where is the solution of the CSP optimization
[TABLE]
and denotes the codebook of the compression code.
In words, if the rate of random linear measurements of is slightly larger than the rate-distortion dimension, then there exists a family of compression codes for which CSP optimization yields an asymptotically distortion-free estimate of in terms of the Euclidean norm. Thus, the rate-distortion dimension is an achievable rate for almost zero-distortion recovery.
To summarize, (101) demonstrates that yields a lower bound on the sampling rate required for almost lossless recovery with Lipschitz-continuous decoders. In contrast, Theorem 18 demonstrates that (and hence also ) is an achievable rate for almost zero-distortion recovery. Furthermore, as illustrated by Example 4, there are processes for which
[TABLE]
Our results thus demonstrate that there exist stationary processes for which the sampling rate required for almost zero-distortion recovery is strictly smaller than the sampling rate required for almost lossless recovery with Lipschitz-continuous decoders. In other words, the fundamental limits of almost zero-distortion recovery and almost lossless recovery are different in general.
Comparing the lower bound (101) for almost lossless recovery with Theorem 18 for almost zero-distortion recovery, we observe that there are two main differences in the setup:
- i)
(101) is obtained for a Lipschitz-continuous decoder , whereas Theorem 18 is based on CSP optimzation;
- ii)
for almost lossless recovery, is required to be exactly equal to with high probability (cf. (99)), whereas for almost zero-distortion recovery it suffices that be small.
The following example presents a class of stationary processes for which almost zero-distortion recovery at rate may also be achieved with linear encoders and decoders. This suggests that the second difference has greater impact.
Example 5
Let be a stationary, univariate, real-valued, Gaussian process possessing a PSD with support . By Theorem 10, we have that . We next invoke the sampling theorem to demonstrate that there exist linear encoders and decoders such that
[TABLE]
and
[TABLE]
as , where .
To describe and , we divide the indices into three groups:
[TABLE]
where is an arbitrary sequence of even integers that tends to infinity sublinearly in . The encoder only reproduces the values of with indices , i.e., . Consequently,
[TABLE]
and the rate converges to as .
We next show that we can find a decoder for which (110) holds. Clearly, the values are directly observed. It therefore remains to estimate the missing values of , which is done via the interpolation formula
[TABLE]
It follows that
[TABLE]
where the last step is due to stationarity. By the sampling theorem for stochastic processes, the expected value on the RHS of (116) vanishes as [21, Th. 1]. Thus, dividing both sides of (116) by and letting gives
[TABLE]
which together with Chebyshev’s inequality [22, Th. 4.10.7] yields (110).
VII Conclusions
Rényi [1] proposed the information dimension and the -dimensional entropy to measure the information content of general RVs. His idea was to quantize the real-valued RV by a uniform quantizer of step size , and to then analyze the entropy of the quantized RV in the limit as tends to infinity. His results demonstrate that any RV with positive information dimension has infinite information content. This is, e.g., the case for RVs whose probability measure has an absolutely-continuous part. The problem becomes even more interesting for stochastic processes , since their information content is not only determined by the distribution of the marginals , but also by their temporal dependence. For example, consider a stationary Gaussian process with bandlimited PSD. On the one hand, Gaussian processes have absolutely-continuous marginals, so one would expect that their information content is infinite. On the other hand, for processes with a bandlimited PSD, the present sample can be perfectly predicted from its infinite past (see Example 4), which suggests that the information content of is zero.
To shed some light on such questions, we proposed a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate divided by in the limit as . We demonstrated that the information dimension rate coincides with the rate-distortion dimension, defined as twice the pre-log factor of the rate-distortion function . We further showed that among all stationary process with PSD , the Gaussian process has the largest information dimension rate. This is consistent with the observation that Gaussian processes are the hardest to predict, hence they are expected to have the largest information content. We then showed that the information dimension rate of stationary Gaussian processes is given by the average rank of , i.e.,
[TABLE]
Specialized to the univariate case, this yields that the information dimension rate is given by the Lebesgue measure of the support of , i.e.,
[TABLE]
This agrees with the intuition that if the PSD of is zero on a set of positive Lebesgue measure, then some samples can be expressed in terms of the remaining samples and have therefore no information content. It further answers the above question whether stationary Gaussian processes with a bandlimited PSD have infinite information content in the positive, unless the PSD is zero almost everywhere.
An alternative definition for the information dimension of a stochastic process was proposed by Jalali and Poor [7] as the information dimension of divided by in the limit as . We referred to this quantity as the block-average information dimension . While and coincide for -mixing processes, in general we have that , where the inequality can be strict. In particular, as illustrated by Example 4, if the support of of the Gaussian process has positive Lebesgue measure, then . Thus, in contrast to the information dimension rate, the block-average information dimension does not capture the dependence of the information dimension on the support size of .
The essential difference between the definitions of and is the order in which the limits over the quantization bin size and the block size are taken. Rezagah et al. [8] showed that these limits can be exchanged if the process satisfies
[TABLE]
in which case . However, in this case the information dimension of the stochastic process coincides with the information dimension of the marginal RV . In other words, for such processes a generalization of information dimension to stochastic processes is redundant. In contrast, we showed in Theorem 9 that, for any stochastic process , the information dimension rate coincides with the rate-distortion dimension . This implies that coincides with only for those stochastic processes for which .
The equivalence between the information dimension rate and the rate-distortion dimension implies that inherits the operational characterizations of . For example, it was demonstrated in [8] that is an achievable rate for almost zero-distortion recovery. In contrast, [20] shows that is a lower bound on the minimum -achievable rate, achievable with Lipschitz-continuous decoders. By demonstrating that there are processes for which
[TABLE]
our results show that the fundamental limits of almost zero-distortion recovery and almost lossless recovery are different in general. Jalali and Poor [7] further showed that is an achievable rate for universal lossless compressed sensing with linear encoding and decoding via Lagrangian minimum entropy pursuit when is -mixing. Since for -mixing processes we have , our definition also inherits this operational characterization.
Appendix A Appendix to Section III
A-A Proof of Lemma 2
The first inequality in (14), namely,
[TABLE]
follows directly from Fatou’s lemma [22, Th. 1.6.8, p. 50]. The second inequality follows because the limit inferior is upper-bounded by the limit superior. For the third inequality, note that for every and [1, eq. (11)]
[TABLE]
Furthermore, since conditioning reduces entropy, we have
[TABLE]
for every Hence, the RHS of (123) is integrable, and the third inequality in (14) follows again by Fatou’s lemma.
A-B Proof of Lemma 4
If , then we have and the right-most inequality in (17) holds trivially. Moreover, in this case for at least one , so for this we also have . Thus, also the left-most inequality holds.
If , then we have
[TABLE]
hence the upper information dimensions are finite. It follows by the chain rule of entropy and because conditioning reduces entropy that
[TABLE]
Likewise, we have
[TABLE]
where the inequality follows because conditioning reduces entropy and because, conditioned on , is independent of .
A-C Proof of Theorem 5
To simplify notation, we shall write collections of RVs as vectors, namely, and . The proof of Theorem 5 is based on the following lemma.
Lemma 19
Let and be - and -dimensional, jointly Gaussian vectors with mean vectors and and joint covariance matrix . Then, there exists a matrix and a length- vector such that . Moreover, has zero mean, is uncorrelated with , and satisfies .
Proof:
If and are jointly Gaussian, then can be written as a linear transformation of and an uncorrelated error. This follows from the fact that, for jointly Gaussian and , the linear minimum mean-square error (LMMSE) estimator of given always exists and is given by . The result that has zero mean, is uncorrelated with , and satisfies follows by direct calculation. ∎
Since information dimension is translation invariant, it follows that
[TABLE]
Furthermore, since and are jointly Gaussian, so are and , and from the fact that they are uncorrelated follows that they are independent. Thus,
[TABLE]
where is the covariance matrix of . The identities (128) and (129) hold for every , so it follows from Lemma 2 that . It remains to show that is the generalized Schur complement of in . Indeed, by [12, 7.1.P28] there exists a matrix such that . The generalized Schur complement of in is then given by
[TABLE]
Comparing (130) with the expression of given in Lemma 19, we observe that if the matrix in Lemma 19 satisfies . This is indeed the case: since , and since and are uncorrelated, we have that
[TABLE]
This proves Theorem 5.
Appendix B Proof of Lemma 7
To prove Lemma 7, we shall need the following auxiliary result.
Lemma 20
Let be a collection of real-valued RVs, and let be Lipschitz continuous with Lipschitz constant . Then,
[TABLE]
Proof:
Note that if for some , then , a cube with diameter . The image of this cube under the Lipschitz function has a diameter not greater than . Computing induces a partition of into -dimensional cubes. Of this partition, at most elements have a nonempty intersection with the image of under . Therefore,
[TABLE]
for every , so Lemma 20 follows by averaging over . ∎
We next prove Lemma 7. Let . To prove the right-most relation in (28), we use that for every and
[TABLE]
The second summand can be further upper-bounded by
[TABLE]
Since every function is Lipschitz with a Lipschitz constant at most , we can use Lemma 20 to bound the RHS of (135) by . Since this term is independent of , the contribution of the second summand on the RHS of (134) vanishes as . We thus obtain by dividing both sides of (134) by and letting and tend to infinity.
To prove the left-most relation in (28), we use that for every and
[TABLE]
The claim follows then by dividing both sides of (136) by and letting and tend to infinity.
Appendix C Proof of Lemma 8
For every and , we have
[TABLE]
Dividing by and letting first and then tend to infinity yields (30).
To prove (31), we note that Lemma 7 and (30) yield . For the reverse inequality, we use [13, Lemma 30] and the fact that conditioning reduces entropy to obtain
[TABLE]
Dividing both sides of (138) by , and letting first and then tend to infinity, yields and proves (31).
Finally, if is discrete and , then , since
[TABLE]
where the second entropy is finite by assumption and the first entropy satisfies . Conversely, if is discrete and , then
[TABLE]
Dividing all terms by and letting tend to infinity thus yields
[TABLE]
Since , the second term on the RHS of (141) tends to zero as tends to infinity. Thus, dividing (141) by , and letting tend to infinity, yields (32).
Appendix D Proof of Theorem 9
The proof of Theorem 9 is essentially identical to the proof of [2, Lemma 3.2]. For the sake of completeness, we reproduce the full proof here. Indeed, choosing in (33)
[TABLE]
yields
[TABLE]
since for the choice (142) we have , hence it satisfies (34). Consequently, dividing by , and taking limits as and , we obtain
[TABLE]
if the limits exist. If the limits do not exist, then we obtain the same upper bound for the limits replaced by limits superior and limits inferior.444Since , taking the limit as is tantamount to taking the limit as .
We next derive a lower bound on the rate-distortion dimension. To simplify notation, we treat the collection of -variate random vectors as a collection of RVs. To show that the upper bound (144) holds with equality, we use the following lower bound on given in [23], [2, eq. (A.1)]:
[TABLE]
where is an arbitrary nonnegative measurable function satisfying
[TABLE]
Following the proof of [2, Lemma 3.2], we apply (145) with
[TABLE]
We first show that this choice of satisfies (146). Indeed,
[TABLE]
where the second step follows by substituting and . Since the sum over does not depend on , it follows that
[TABLE]
which can be upper-bounded as
[TABLE]
Hence,
[TABLE]
which, by (149), is equal to . It follows that and , as chosen in (147) and (148), satisfy (146).
We next evaluate (145) for this choice of and and for distortion . This yields
[TABLE]
For , this becomes
[TABLE]
We next replace again the collection of RVs by the equivalent collection of random vectors. Dividing both sides of (155) by , and taking the limits as and , yields
[TABLE]
if the limits over and exist. If the limits do not exist, then we obtain the same lower bound for the limits replaced by limits superior and limits inferior. Combining (156) with (144) proves Theorem 9.
Appendix E Proof of Theorem 10
The proof consists of two parts. In the first part, we show that of all processes with a given SDF , the Gaussian process has the largest information dimension rate (Section E-A). In the second part, we demonstrate that the information dimension rate of Gaussian processes is given by the average rank of the derivative of the SDF (Section E-B).
E-A Gaussian Processes Maximize the Information Dimension
By Theorem 9, the upper information dimension rate is given by
[TABLE]
The claim that the information dimension is maximized by a Gaussian process then follows by the well-known fact that of all random vectors with a given covariance matrix , the Gaussian random vector has the largest rate-distortion function .
To prove this claim for multivariate sources, we shall write the collection of -variate vectors as a collection of RVs , where . Since the information dimension rate is translation invariant (Lemma 7), we can assume without loss of optimality that the RVs have zero mean. Furthermore, by the eigenvalue decomposition, there exists an orthogonal matrix such that the random variables given by are uncorrelated and their variances are the eigenvalues of , which we shall denote by . Since mutual information is invariant under bijections, and the Euclidean norm is invariant under multiplications by orthogonal matrices, it follows that .
For the case where are independent, zero-mean, Gaussian random variables with variances , the rate-distortion function is given by [10, Th. 13.3.3]
[TABLE]
where
[TABLE]
and is chosen so that .
The RHS of (158) can also be achieved for non-Gaussian RVs by choosing the following (possibly suboptimal) distribution of the reconstruction values :
[TABLE]
where are independent, zero-mean, Gaussian RVs with variances , and is as in (159). Indeed, it is easy to check that (160) satisfies the distortion constraint (34). Furthermore, by using that conditioning reduces entropy and that Gaussian RVs maximize differential entropy, it can be shown that
[TABLE]
Comparing (161) with (158), we conclude that, for uncorrelated RVs ,
[TABLE]
where are jointly Gaussian with the same covariance matrix as . Since and R\bigl{(}(Y_{1}^{k^{\prime}})_{G},kD\bigr{)}=R\bigl{(}(X_{1}^{k^{\prime}})_{G},kD\bigr{)}, the same is also true for general RVs. Together with (157), this proves that of all processes with a given SDF , the Gaussian process has the largest information dimension rate.
E-B The Information Dimension of Gaussian Processes
We now assume that is Gaussian. For every , we define
[TABLE]
Furthermore, let be i.i.d. (over all and ) and uniformly distributed on , and let . We define , , and as the corresponding multivariate processes. Since is independent of for every , the (matrix-valued) SDFs of , , and satisfy
[TABLE]
Moreover, the (matrix-valued) PSD of exists and equals
[TABLE]
Since the information dimension rate is translation invariant (Lemma 7), and since the SDF does not depend on the mean vector , we can assume without loss of generality that has zero mean. We further show in Lemma 21 in Appendix E-C that we can assume without loss of generality that every component process of has unit variance. By (48) in Lemma 11, it thus follows that
[TABLE]
We continue by writing the entropy of in terms of a differential entropy, i.e.,
[TABLE]
Denoting by a Gaussian vector with the same mean and covariance matrix as , and denoting by and the PDFs of and , respectively, this can be expressed as
[TABLE]
Dividing by , and letting first and then tend to infinity, yields the information dimension rate . Lemma 22 in Appendix E-C shows that
[TABLE]
for some constant that is independent of . Moreover, the differential entropy rate of the stationary, -variate, Gaussian process is given by [15, Th. 7.10]
[TABLE]
It thus follows that the information dimension rate of equals
[TABLE]
It remains to show that the RHS of (171) is equal to the RHS of (43). To do so, we first show that the integral on the RHS of (171) can be restricted to a subset on which the entries of are bounded from above by for some . We then show that, on this set, can be bounded from above and from below by products of affine transforms of the eigenvalues of . These bounds are asymptotically tight, i.e., they are equal in the limit as tends to infinity. We complete the proof by showing that the order of limit and integration can be exchanged.
E-B1 Restriction on
Choose and let
[TABLE]
By (47) in Lemma 11, we have for every
[TABLE]
Since the set is the union of , , it then follows by the union bound that
[TABLE]
To prove (174), we note that, by the Lebesgue decomposition theorem [22, Th. 2.2.6] and the fact that the Radon-Nikodym derivative coincides with almost everywhere [22, Sec. 2.3],
[TABLE]
where the second inequality follows because is nonnegative, and the third inequality follows by definition of . By (47) in Lemma 11, the integral on the left-hand side (LHS) of (176) is upper-bounded by , hence (174) follows.
By (164) and (165), we have that
[TABLE]
Since derivatives of matrix-valued SDFs are positive semidefinite, it follows that
[TABLE]
Hence,
[TABLE]
where the last step follows from (175). Applying Hadamard’s and Jensen’s inequality, we further get
[TABLE]
where the last step follows from (47), (164), (166), and the assumption that every component process of has zero mean and unit variance. Since, by (46) in Lemma 11, as , (180) yields
[TABLE]
Consequently,
[TABLE]
for every . It follows that this integral does not contribute to the information dimension rate if we let tend to infinity. In view of (171), we thus obtain the information dimension rate by evaluating
[TABLE]
in the limit as first and then tends to infinity.
E-B2 Bounding by the Eigenvalues of
[TABLE]
Let , , denote the eigenvalues of . Since is positive semidefinite, we obtain
[TABLE]
We next derive an upper bound on . Let denote the -matrix norm of . Since is positive semidefinite, the element with the maximum modulus is on the main diagonal; cf. [12, Problem 7.1.P1]. Furthermore, by assumption, on the diagonal elements of are bounded from above by . We hence obtain that
[TABLE]
It is known that all matrix norms bound the largest eigenvalue of the matrix from above [12, Th. 5.6.9].555This bound holds without a multiplicative constant, since the spectral radius of a matrix is the infimum of all matrix norms [12, Lemma 5.6.10]. Thus, the upper bound (186) is also an upper bound on the largest eigenvalue of . Let , , denote the eigenvalues of . Then, we have for (such that ) [12, Cor. 4.3.15]
[TABLE]
Combining (185) and (187) with (183), we obtain
[TABLE]
To compute the limit of (183) as , we thus need to evaluate
[TABLE]
where is either (left-most inequality in (188)) or (right-most inequality in (188)).
E-B3 Exchanging Limit and Integration
To evaluate (189), we continue along the lines of [24, Sec. VIII]. Specifically, for each , we split the integral on the RHS of (189) into three parts:
[TABLE]
where is arbitrary.
For the first part, we obtain
[TABLE]
which evaluates to in the limit as .
We next show that the integrals over and do not contribute to (189). To this end, it suffices to consider the integral of the function
[TABLE]
In the remainder of the proof, we shall assume without loss of generality that , in which case on . Clearly, whenever , the function in (194) converges to zero as . Moreover, for , this function is nonpositive.
For all we have , hence we can find a sufficiently large such that, by (46) in Lemma 11, we have , . Since by the same result we also have , , it follows that, for ,
[TABLE]
The LHS of (195) is nonpositive and monotonically increases to zero as . We can thus apply the monotone convergence theorem [22, Th. 1.6.7, p. 49] to get
[TABLE]
We next turn to the case . It was shown in [24, p. 443] that if , then the function in (194) is bounded from above by 1. Furthermore, if then it is nonnegative, and if then it is nonpositive and monotonically increasing in . Restricting ourselves to the case , we thus obtain for
[TABLE]
where we made use of the fact that , and, by (46) in Lemma 11, , . Hence, on the magnitude of the function in (194) is bounded by
[TABLE]
We can thus apply the dominated convergence theorem [22, Th. 1.6.9, p. 50] to get
[TABLE]
Combining (193), (196), and (199), we can evaluate (189) as
[TABLE]
E-B4 Wrapping Up
To compute the limit of (183) as first and then tends to infinity, it remains to let on the RHS of (200). By the continuity of the Lebesgue measure, this yields
[TABLE]
To summarize, combining (171), (182), and (200), we obtain that
[TABLE]
This proves Theorem 10.
E-C Auxiliary Results
Lemma 21
Suppose that is a stationary, -variate, real-valued, Gaussian process with mean vector and SDF . Suppose that the component processes are ordered by their variances, i.e.,
[TABLE]
Then,
[TABLE]
and, for almost every ,
[TABLE]
Proof:
Normalizing component processes with positive variance to unit variance does not affect the information dimension rate, as follows from Lemma 7. If , then the component process is almost surely constant. It follows that for every and every , so
[TABLE]
Dividing by , and letting and tend to infinity, shows that .
Let be an diagonal matrix with values on the main diagonal. For component processes with zero variance, the corresponding row and column of is zero almost everywhere. Hence, we have for almost every that
[TABLE]
where [math] denotes an all-zero matrix of appropriate size. We thus have for almost every . ∎
Lemma 22
Let be an -variate, real-valued, Gaussian vector with mean vector and covariance matrix . Let , where is an -variate vector, independent of , with components independently and uniformly distributed on . Then,
[TABLE]
Proof:
By [25, Th. 23.6.14], can be written as
[TABLE]
where is an -dimensional, zero-mean, Gaussian vector () with independent components whose variances are the nonzero eigenvalues of and where is an matrix satisfying . We use the data processing inequality, the chain rule for relative entropy, and the fact that is Gaussian, to obtain
[TABLE]
where denotes the PDF of a Gaussian vector with the same mean vector and covariance matrix as , and
[TABLE]
To evaluate the relative entropy on the RHS of (210), we first note that, given , the random vector is uniformly distributed on an -dimensional cube of length . Since can be obtained from via (209), the conditional PDF of given is
[TABLE]
Consequently, denoting ,
[TABLE]
where and denote the conditional mean and the conditional covariance matrix of given . These can be computed as [25, Th. 23.7.4]
[TABLE]
where denotes the cross-covariance matrix of and , and and denote the covariance matrices of and , respectively.
Defining , we have . Since is independent of , the cross-covariance matrix of and is equal to the cross-covariance matrix of and . Bussgang’s theorem [26, eq. (20)] yields , where is defined in (45). Hence, if is a diagonal matrix with on the main diagonal, then . From (209) we get and , hence
[TABLE]
Together with (215) and (216), this yields
[TABLE]
Combining (218) with (209), and using the triangle inequality, we upper-bound each component of as
[TABLE]
The first and the third term on the RHS of (220) are both upper-bounded by , and the second term is upper-bounded by . From (46) in Lemma 11, we get that the term is upper-bounded by , where is the variance of . We thus obtain
[TABLE]
We next note that, since , and since is independent from and i.i.d. on ,
[TABLE]
It can be shown that is the conditional covariance matrix of given , hence it is positive semidefinite.666Indeed, we have and, by (209), . Replacing in (216) by , and repeating the steps leading to (219), we obtain the desired result. It follows that the smallest eigenvalue of is lower-bounded by . Together with (221), this yields for the second term on the RHS of (LABEL:eq:kld_cond)
[TABLE]
To upper-bound the first term on the RHS of (LABEL:eq:kld_cond), we use that (222) combined with Lemma 11 implies that every diagonal element of is given by
[TABLE]
The first term on the RHS of (224) is negative, and the second term is upper-bounded by . Hence, every element on the main diagonal of is upper-bounded by . It thus follows from Hadamard’s inequality that
[TABLE]
Combining (223) and (225) with (LABEL:eq:kld_cond) and (210) yields
[TABLE]
and completes the proof. ∎
Appendix F Spectral Distribution Function of
Let be a stationary, -variate, Gaussian process with mean vector and SDF . Let and be defined as and , respectively. For every pair , we have
[TABLE]
Bussgang’s theorem [26, eq. (20)] further yields that , where is defined in (45). Consequently,
[TABLE]
Since the SDF is fully determined by the covariance structure of a process [27, Th. 1, p. 206], we obtain (44).
To prove (47), namely,
[TABLE]
we note that
[TABLE]
Since and , the claim (47) follows.
It remains to prove (46), namely,
[TABLE]
Set , . We have
[TABLE]
Furthermore,
[TABLE]
It follows that
[TABLE]
Since for , this yields
[TABLE]
This proves (46) and concludes the proof of Lemma 11.
Appendix G Proof of Theorem 13
Let be a stationary, -variate, complex-valued process with matrix-valued SDF . Let the real composite process be defined as . That is, is obtained by stacking the real part of on top of the imaginary part of . Further let the augmented process be defined as . Clearly, and satisfy , where
[TABLE]
is unitary up to a factor of , i.e., . The matrix-valued autocovariance function of reads
[TABLE]
where denotes the pseudo-autocovariance function of . The corresponding matrix-valued SDF is given by
[TABLE]
where satisfies
[TABLE]
The autocovariance functions and SDFs of and are related via
[TABLE]
By definition, . It thus follows from Theorem 10 that
[TABLE]
Since left or right multiplication by a nonsingular matrix leaves the rank unchanged, we obtain from (241) that the rank of is equal to the rank of . Furthermore, by (238), the rank of is upper-bounded by the rank of plus the rank of [28, Th. 1]. Consequently,
[TABLE]
where the second step follows because complex conjugation does not affect the rank.
If is Gaussian, then (242) holds with equality by Theorem 10. If is, in addition, proper then , so the derivative of is zero almost everywhere. Hence, the derivative of becomes block diagonal almost everywhere and its rank equals the sum of the ranks of its diagonal elements. We conclude that, if is proper Gaussian, then (243) holds with equality. This proves Theorem 13.
Appendix H Appendix to Section V
H-A Proof of Theorem 14
For every and we have
[TABLE]
by stationarity; and because conditioning reduces entropy and, conditioned on , is independent of . Note that, by (4) and stationarity,
[TABLE]
Thus, dividing (244) by and taking first the limit over and then the limit over yields
[TABLE]
This proves (73).
We next bound the difference . By (245), we have
[TABLE]
Dividing (247) by and taking first the limit over and then the limit over yields
[TABLE]
This concludes the proof of Theorem 14.
H-B Proof of Corollary 15
Suppose there exists a nonnegative such that
[TABLE]
We first show that
[TABLE]
In a second step, we then show that (249) implies that
[TABLE]
which together with (250) and (74) demonstrates that , thus proving Corollary 15.
To prove (250), we use the chain rule, stationarity, and the fact that conditioning reduces entropy, to obtain
[TABLE]
Having obtained (250), we next show that (249) implies (251). Indeed,
[TABLE]
where is a nonnegative integer satisfying (249). Here, the first inequality follows from the chain rule; the second inequality follows from the data processing inequality and by upper-bounding the second mutual information by .
The first limit on the RHS of (253) is zero because, by assumption, . The second limit on the RHS of (253) can be written as , which is zero because, by Lemma 1, is bounded in . This proves (251) and concludes the proof of Corollary 15.
H-C Proof of Lemma 16
Since is Gaussian, the conditional mean of given can be written as
[TABLE]
for some coefficients .777More precisely, the coefficients correspond to the LMMSE estimator for estimating from . The LMMSE estimator always exists, even though it is not necessarily unique. The conditional variance is thus given by (see, e.g., [19, Sec. 10.6])
[TABLE]
The function
[TABLE]
is analytic on the closed interval , hence it is either constant or it has at most finitely many zeros in . Moreover, cannot be the all-zero function, as can be argued by contradiction. Indeed, suppose there exist such that for all . Then, by (255), we have irrespective of . In other words, we can find a linear estimator that perfectly predicts from irrespective of the SDF of . This is clearly a contradiction, since even the best predictor yields for an i.i.d., zero-mean, variance-, Gaussian process, i.e., when . Thus, the set is finite and has therefore Lebesgue measure zero.
Since for , we have
[TABLE]
Since furthermore for , we have only if
[TABLE]
This implies that for all . Hence, the set of harmonics for which is contained in . The proof is completed by the monotonicity of measures and the fact that has Lebesgue measure zero.
Acknowledgment
Fruitful discussions with Amos Lapidoth are gratefully acknowledged. The authors further wish to thank the Associate Editor Matthieu Bloch and the anonymous referees for their valuable comments.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Rényi, “On the dimension and entropy of probability distributions,” Acta Mathematica Hungarica , vol. 10, no. 1-2, pp. 193–215, Mar. 1959.
- 2[2] T. Kawabata and A. Dembo, “The rate-distortion dimension of sets and measures,” IEEE Trans. Inf. Theory , vol. 40, no. 5, pp. 1564–1572, Sep. 1994.
- 3[3] T. Koch, “The Shannon lower bound is asymptotically tight,” IEEE Trans. Inf. Theory , vol. 62, no. 11, pp. 6155–6161, Nov. 2016.
- 4[4] Y. Wu and S. Verdú, “Rényi information dimension: Fundamental limits of almost lossless analog compression,” IEEE Trans. Inf. Theory , vol. 56, no. 8, pp. 3721–3748, Aug. 2010.
- 5[5] Y. Wu, S. Shamai (Shitz), and S. Verdú, “Information dimension and the degrees of freedom of the interference channel,” IEEE Trans. Inf. Theory , vol. 61, no. 1, pp. 256–279, Jan. 2015.
- 6[6] D. Stotz and H. Bölcskei, “Degrees of freedom in vector interference channels,” IEEE Trans. Inf. Theory , vol. 62, no. 7, pp. 4172–4197, Jul. 2016.
- 7[7] S. Jalali and H. V. Poor, “Universal compressed sensing for almost lossless recovery,” IEEE Trans. Inf. Theory , vol. 63, no. 5, pp. 2933–2953, May 2017.
- 8[8] F. E. Rezagah, S. Jalali, E. Erkip, and H. V. Poor, “Compression-based compressed sensing,” IEEE Trans. Inf. Theory , vol. 63, no. 10, pp. 6735–6752, Oct. 2017.
