On the Information Dimension of Stochastic Processes

Bernhard C. Geiger; Tobias Koch

arXiv:1702.00645·cs.IT·September 26, 2019

On the Information Dimension of Stochastic Processes

Bernhard C. Geiger, Tobias Koch

PDF

TL;DR

This paper extends the concept of information dimension to stochastic processes, linking it to rate-distortion theory and spectral properties, and characterizes the maximum information dimension rate among Gaussian processes.

Contribution

It introduces the information dimension rate for stochastic processes, establishes its equivalence with the rate-distortion dimension, and characterizes it for Gaussian processes based on spectral properties.

Findings

01

Information dimension rate equals the rate-distortion dimension.

02

Gaussian processes maximize the information dimension rate among stationary processes.

03

The information dimension rate of Gaussian processes depends on the average rank of the spectral derivative.

Abstract

In 1959, R\'enyi proposed the information dimension and the $d$ -dimensional entropy to measure the information content of general random variables. This paper proposes a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate of the uniformly-quantized stochastic process divided by minus the logarithm of the quantizer step size $1/ m$ in the limit as $m \to \infty$ . It is demonstrated that the information dimension rate coincides with the rate-distortion dimension, defined as twice the rate-distortion function $R (D)$ of the stochastic process divided by $- lo g (D)$ in the limit as $D ↓ 0$ . It is further shown that, among all multivariate stationary processes with a given (matrix-valued) spectral distribution function (SDF), the Gaussian process has the largest information dimension rate, and that the information…

Equations536

H ([X]_{m}) = d (X) lo g m + H_{d} (X) + o (1)

H ([X]_{m}) = d (X) lo g m + H_{d} (X) + o (1)

[X]_{m} ≜ \frac{⌊ m X ⌋}{m}

[X]_{m} ≜ \frac{⌊ m X ⌋}{m}

H^{'} ({X_{t}}) ≜ k \to \infty lim \frac{H ( X _{1}^{k} )}{k} .

H^{'} ({X_{t}}) ≜ k \to \infty lim \frac{H ( X _{1}^{k} )}{k} .

k \to \infty lim \frac{H ( X _{1}^{k} )}{k} = k \to \infty lim H (X_{1} ∣ X_{- k}^{0}) .

k \to \infty lim \frac{H ( X _{1}^{k} )}{k} = k \to \infty lim H (X_{1} ∣ X_{- k}^{0}) .

k \to \infty lim A, B sup \frac{P _{X_{- \infty}^{0}, X_{k}^{\infty}} ( A \cap B )}{P _{X_{- \infty}^{0}} ( A ) P _{X_{k}^{\infty}} ( B )} = 1

k \to \infty lim A, B sup \frac{P _{X_{- \infty}^{0}, X_{k}^{\infty}} ( A \cap B )}{P _{X_{- \infty}^{0}} ( A ) P _{X_{k}^{\infty}} ( B )} = 1

k \to \infty lim I (X_{k}^{\infty}; X_{- \infty}^{0}) = 0.

k \to \infty lim I (X_{k}^{\infty}; X_{- \infty}^{0}) = 0.

d (X_{1}^{k}) ≜ m \to \infty lim \frac{H ([ X _{1}^{k} ] _{m} )}{lo g m}, if the limit exists.

d (X_{1}^{k}) ≜ m \to \infty lim \frac{H ([ X _{1}^{k} ] _{m} )}{lo g m}, if the limit exists.

\overset{ˉ}{d} (X_{1}^{k})

\overset{ˉ}{d} (X_{1}^{k})

\underline{d} (X_{1}^{k})

\overset{ˉ}{d} (X_{1}^{k}) = \underline{d} (X_{1}^{k}) = d (X_{1}^{k})

\overset{ˉ}{d} (X_{1}^{k}) = \underline{d} (X_{1}^{k}) = d (X_{1}^{k})

d (X ∣ W) ≜ m \to \infty lim \frac{H ([ X ] _{m} ∣ W )}{lo g m}

d (X ∣ W) ≜ m \to \infty lim \frac{H ([ X ] _{m} ∣ W )}{lo g m}

0 \leq \underline{d} (X_{1}^{k}) \leq \overset{ˉ}{d} (X_{1}^{k}) \leq k .

0 \leq \underline{d} (X_{1}^{k}) \leq \overset{ˉ}{d} (X_{1}^{k}) \leq k .

P_{X} = (1 - ρ) P_{d} + ρ P_{c}

P_{X} = (1 - ρ) P_{d} + ρ P_{c}

d (X) = ρ .

d (X) = ρ .

\int \underline{d} (X ∣ Y = y) d P_{Y} (y) \leq \underline{d} (X ∣ Y) \leq \overset{ˉ}{d} (X ∣ Y) \leq \int \overset{ˉ}{d} (X ∣ Y = y) d P_{Y} (y) .

\int \underline{d} (X ∣ Y = y) d P_{Y} (y) \leq \underline{d} (X ∣ Y) \leq \overset{ˉ}{d} (X ∣ Y) \leq \int \overset{ˉ}{d} (X ∣ Y = y) d P_{Y} (y) .

d (X ∣ Y) = \int d (X ∣ Y = y) d P_{Y} (y) .

d (X ∣ Y) = \int d (X ∣ Y = y) d P_{Y} (y) .

\overline{\underline{d}} (X ∣ Y) \leq \overline{\underline{d}} (X)

\overline{\underline{d}} (X ∣ Y) \leq \overline{\underline{d}} (X)

t = 1 \sum k \overset{ˉ}{d} (X_{t}) \geq \overset{ˉ}{d} (X_{1}^{k}) \geq \underline{d} (X_{1}^{k}) \geq t = 1 \sum k \underline{d} (X_{t} ∣ X_{1}^{t - 1}) .

t = 1 \sum k \overset{ˉ}{d} (X_{t}) \geq \overset{ˉ}{d} (X_{1}^{k}) \geq \underline{d} (X_{1}^{k}) \geq t = 1 \sum k \underline{d} (X_{t} ∣ X_{1}^{t - 1}) .

d (X_{1}, X_{2}, Y) \geq d (X_{1}, X_{2}) + d (Y ∣ X_{1}, X_{2}) = 2.

d (X_{1}, X_{2}, Y) \geq d (X_{1}, X_{2}) + d (Y ∣ X_{1}, X_{2}) = 2.

d (Y) + d (X_{1}, X_{2} ∣ Y) \leq 1.

d (Y) + d (X_{1}, X_{2} ∣ Y) \leq 1.

d (X_{1}, X_{2}, Y) > d (Y) + d (X_{1}, X_{2} ∣ Y)

d (X_{1}, X_{2}, Y) > d (Y) + d (X_{1}, X_{2} ∣ Y)

C_{X_{1}^{k},Y_{1}^{k}}=\left[\begin{array}[]{cc}C_{X_{1}^{k}}&C_{X_{1}^{k}Y_{1}^{k}}\\ C_{X_{1}^{k}Y_{1}^{k}}^{\textnormal{{\tiny T}}}&C_{Y_{1}^{k}}\end{array}\right].

C_{X_{1}^{k},Y_{1}^{k}}=\left[\begin{array}[]{cc}C_{X_{1}^{k}}&C_{X_{1}^{k}Y_{1}^{k}}\\ C_{X_{1}^{k}Y_{1}^{k}}^{\textnormal{{\tiny T}}}&C_{Y_{1}^{k}}\end{array}\right].

\overline{\underline{d}} (X_{1}^{k}) \leq rank (C_{X_{1}^{k}}) .

\overline{\underline{d}} (X_{1}^{k}) \leq rank (C_{X_{1}^{k}}) .

d (X_{1}^{k} ∣ Y_{1}^{ℓ}) = rank (C_{X_{1}^{k} ∣ Y_{1}^{ℓ}})

d (X_{1}^{k} ∣ Y_{1}^{ℓ}) = rank (C_{X_{1}^{k} ∣ Y_{1}^{ℓ}})

d ({X_{t}}) ≜ m \to \infty lim k \to \infty lim \frac{H ([ X _{1}^{k} ] _{m} )}{k lo g m}

d ({X_{t}}) ≜ m \to \infty lim k \to \infty lim \frac{H ([ X _{1}^{k} ] _{m} )}{k lo g m}

0 \leq \overline{\underline{d}} ({X_{t}}) \leq \overline{\underline{d}} (X_{1}) \leq L .

0 \leq \overline{\underline{d}} ({X_{t}}) \leq \overline{\underline{d}} (X_{1}) \leq L .

H ([X_{1}]_{1}) \leq H ([X_{1}^{k}]_{m}) .

H ([X_{1}]_{1}) \leq H ([X_{1}^{k}]_{m}) .

t \in Z sup K_{t} < \infty.

t \in Z sup K_{t} < \infty.

\overline{\underline{d}} ({f_{t} (X_{t})}) \leq \overline{\underline{d}} ({X_{t}}) = \overline{\underline{d}} ({X_{t}, f_{t} (X_{t})}) .

\overline{\underline{d}} ({f_{t} (X_{t})}) \leq \overline{\underline{d}} ({X_{t}}) = \overline{\underline{d}} ({X_{t}, f_{t} (X_{t})}) .

\overline{\underline{d}} ({W_{t} X_{t} + c_{t}}) = \overline{\underline{d}} ({X_{t}}) .

\overline{\underline{d}} ({W_{t} X_{t} + c_{t}}) = \overline{\underline{d}} ({X_{t}}) .

\overline{\underline{d}} ({X_{t}}, {Z_{t}})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On the Information Dimension

of Stochastic Processes

Bernhard C. Geiger, and Tobias Koch The work of Bernhard C. Geiger has partly been funded by the Erwin Schrödinger Fellowship J 3765 of the Austrian Science Fund and by the German Ministry of Education and Research in the framework of an Alexander von Humboldt Professorship. The Know-Center is funded within the Austrian COMET Program - Competence Centers for Excellent Technologies - under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Digital and Economic Affairs, and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG. The work of Tobias Koch has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 714161), from the 7th European Union Framework Programme under Grant 333680, from the Ministerio de Economía y Competitividad of Spain under Grants TEC2013-41718-R, RYC-2014-16332, and TEC2016-78434-C3-3-R (AEI/FEDER, EU), and from the Comunidad de Madrid under Grant S2103/ICE-2845. This work has been presented in part at the 2017 IEEE International Symposium on Information Theory, Aachen, Germany, June 2017, and at the 2018 International Zurich Seminar on Information and Communication, Zurich, Switzerland, February 2018.Bernhard C. Geiger is with Know-Center GmbH, 8010, Graz, Austria (e-mail:[email protected]).Tobias Koch is with the Signal Theory and Communications Department, Universidad Carlos III de Madrid, 28911, Leganés, Spain and also with the Gregorio Marañón Health Research Institute, 28007, Madrid, Spain (e-mail:[email protected]).Copyright (c) 2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected].

Abstract

In 1959, Rényi proposed the information dimension and the $d$ -dimensional entropy to measure the information content of general random variables. This paper proposes a generalization of information dimension to stochastic processes by defining the information dimension rate as the entropy rate of the uniformly-quantized stochastic process divided by minus the logarithm of the quantizer step size $1/m$ in the limit as $m\to\infty$ . It is demonstrated that the information dimension rate coincides with the rate-distortion dimension, defined as twice the rate-distortion function $R(D)$ of the stochastic process divided by $-\log(D)$ in the limit as $D\downarrow 0$ . It is further shown that, among all multivariate stationary processes with a given (matrix-valued) spectral distribution function (SDF), the Gaussian process has the largest information dimension rate, and that the information dimension rate of multivariate stationary Gaussian processes is given by the average rank of the derivative of the SDF. The presented results reveal that the fundamental limits of almost zero-distortion recovery via compressible signal pursuit and almost lossless analog compression are different in general.

Index Terms:

Entropy, Gaussian process, information dimension, rate-distortion dimension

I Introduction

In 1959, Rényi [1] proposed the information dimension and the $d$ -dimensional entropy to measure the information content of general random variables (RVs). His idea was to quantize the RV $X$ by a uniform quantizer of step size $1/m$ , and to then analyze the entropy of the quantized RV $[X]_{m}$ in the limit as $m$ tends to infinity. Assuming that the entropy $H([X]_{m})$ exists and the asymptotic expansion

[TABLE]

holds for $m\to\infty$ (where $o(1)$ refers to remainder terms that vanish as $m\to\infty$ ), Rényi referred to $d(X)$ as the information dimension and to $H_{d}(X)$ as the $d$ -dimensional entropy.

In recent years, it was shown that the information dimension is of relevance in various areas of information theory, including rate-distortion theory, almost lossless analog compression, or the analysis of interference channels. For example, Kawabata and Dembo [2] showed that the information dimension of a RV is equal to its rate-distortion dimension, defined as twice the rate-distortion function $R(D)$ divided by $-\log(D)$ in the limit as $D\downarrow 0$ . Koch [3] demonstrated that the rate-distortion function of a source with infinite information dimension is infinite, and that for any source with finite information dimension and finite differential entropy the Shannon lower bound on the rate-distortion function is asymptotically tight. Wu and Verdú [4] analyzed linear encoding and Lipschitz decoding of discrete-time, independent and identically distributed (i.i.d.), stochastic processes and showed that the information dimension plays a fundamental role in achievability and converse results. Wu et al. [5] showed that the degrees of freedom of the $K$ -user Gaussian interference channel can be characterized through the sum of information dimensions. Stotz and Bölcskei [6] generalized this result to vector interference channels.

Jalali and Poor [7] proposed a generalization of information dimension to stationary, discrete-time, stochastic processes by defining the information dimension $d^{\prime}(\{X_{t}\})$ of the stochastic process $\{X_{t}\}$ as the information dimension of $(X_{1},\ldots,X_{k})$ divided by $k$ in the limit as $k\to\infty$ .111More precisely, Jalali and Poor define the information dimension of a stochastic process via a conditional entropy of the uniformly-quantized process. For stationary processes, their definition coincides with the above-mentioned definition [7, Lemma 3]. They showed that, for $\psi^{*}$ -mixing processes, the information dimension is an achievable rate for universal compressed sensing with linear encoding and decoding via Lagrangian minimum entropy pursuit [7, Th. 8]. Rezagah et al. [8] showed that $d^{\prime}(\{X_{t}\})$ coincides, under certain conditions, with the rate-distortion dimension $\mathsf{dim}_{R}(\{X_{t}\})$ , thus generalizing the result by Kawabata and Dembo [2] to stochastic processes. Other notions of information dimensions for stochastic processes are discussed in [9].

In this paper, we propose a different definition for the information dimension of stationary, discrete-time, stochastic processes. Specifically, let $\{[X_{t}]_{m}\}$ denote the stochastic process $\{X_{t}\}$ uniformly quantized with step size $1/m$ . We define the information dimension rate $d(\{X_{t}\})$ of $\{X_{t}\}$ as the entropy rate of $\{[X_{t}]_{m}\}$ divided by $\log m$ in the limit as $m\to\infty$ . For i.i.d. processes, our definition coincides with that of Jalali and Poor (and, in fact, evaluates to Rényi’s information dimension of the marginal RV $X_{t}$ ). More generally, we show that these definitions are equivalent for $\psi^{*}$ -mixing processes. Nevertheless, there are stochastic processes for which the two definitions disagree. In particular, we derive a closed-form expression for the information dimension rate of stationary, multivariate, Gaussian processes with power spectral density (PSD) $\mathsf{S}_{X}$ , which specialized to the univariate case yields that $d(\{X_{t}\})$ is equal to the Lebesgue measure of the set of harmonics on $[-1/2,1/2]$ where $\mathsf{S}_{X}$ is positive. For Gaussian processes with a bandlimited PSD, this implies that the information dimension rate $d(\{X_{t}\})$ is equal to twice the PSD’s bandwidth. This is consistent with the intuition that for such processes not all samples contain information. For example, if the bandwidth of the PSD is $1/4$ , then we expect that half of the samples in $\{X_{t}\}$ can be expressed as linear combinations of the other samples and, hence, do not contain information. In contrast, we show that the information dimension $d^{\prime}(\{X_{t}\})$ is $1$ if $\mathsf{S}_{X}$ is positive on any set with positive Lebesgue measure. In other words, $d^{\prime}(\{X_{t}\})$ does not capture the dependence of the information dimension on the support size of $\mathsf{S}_{X}$ .

By emulating the proof of [2, Lemma 3.2], we further show that, for any stochastic process $\{X_{t}\}$ , the information dimension rate $d(\{X_{t}\})$ coincides with the rate-distortion dimension $\mathsf{dim}_{R}(\{X_{t}\})$ . This implies that $d^{\prime}(\{X_{t}\})$ coincides with $\mathsf{dim}_{R}(\{X_{t}\})$ only for those stochastic processes for which $d^{\prime}(\{X_{t}\})=d(\{X_{t}\})$ .

The rest of this paper is organized as follows. In Section II, we introduce the notation used in this paper. In Section III, we present preliminary results on the Rényi information dimension of RVs and random vectors. In Section IV, we present our definition of the information dimension rate of a stochastic process, discuss its connection to the rate-distortion dimension, and compute the information dimension rate of stationary Gaussian processes. In Section V, we review the information dimension proposed by Jalali and Poor and discuss its relation to $d(\{X_{t}\})$ . In Section VI, we briefly discuss the operational meanings of information dimension in compressed sensing and zero-distortion recovery. Section VII concludes the paper with a discussion of the obtained results. Some of the proofs are deferred to the appendices.

II Notation and Preliminaries

We denote by $\mathbb{R}$ , $\mathbb{C}$ , and $\mathbb{Z}$ the set of real numbers, the set of complex numbers, and the set of integers, respectively. We further denote by $\mathbb{R}^{+}$ and $\mathbb{N}$ the set of nonnegative real numbers and the set of positive integers, respectively. We use a calligraphic font, such as $\mathcal{F}$ , to denote other sets, and we denote complements as $\mathcal{F}^{\mathsf{c}}$ . The set difference between two sets $\mathcal{F}$ and $\mathcal{G}$ is written as $\mathcal{F}\setminus\mathcal{G}$ .

The real and imaginary parts of a complex number $z$ are denoted as $\mathfrak{Re}(z)$ and $\mathfrak{Im}(z)$ , respectively, i.e., $z=\mathfrak{Re}(z)+\imath\mathfrak{Im}(z)$ where $\imath\triangleq\sqrt{-1}$ . The complex conjugate of $z$ is denoted as $z^{*}$ .

We use uppercase letters to denote deterministic matrices and boldface lowercase letters to denote deterministic vectors. The transpose of a vector or matrix is denoted by $(\cdot)^{\textnormal{{\tiny T}}}$ , the Hermitian transpose by $(\cdot)^{\mathsf{H}}$ . The determinant and rank of a matrix $A$ are $\det A$ and $\mathrm{rank}(A)$ , respectively. We denote by $I_{L}$ the $L\times L$ identity matrix.

We denote RVs by uppercase letters, e.g., $X$ . For a finite or countably infinite collection of RVs we abbreviate $X_{\ell}^{k}\triangleq(X_{\ell},\dots,X_{k-1},X_{k})$ , $X_{\ell}^{\infty}\triangleq(X_{\ell},X_{\ell+1},\dots)$ , and $X_{-\infty}^{k}\triangleq(\dots,X_{k-1},X_{k})$ .222If $k<\ell$ , then $X_{\ell}^{k}$ is the empty set. Random vectors are denoted by boldface uppercase letters, e.g., $\mathbf{X}\triangleq(X_{1},\dots,X_{L})^{\textnormal{{\tiny T}}}$ . Univariate discrete-time stochastic processes are denoted as $\{X_{t},\,t\in\mathbb{Z}\}$ or, in short, as $\{X_{t}\}$ . For $L$ -variate stochastic processes we use the same notation but with $X_{t}$ replaced by $\mathbf{X}_{t}\triangleq(X_{1,t},\dots,X_{L,t})^{\textnormal{{\tiny T}}}$ . We call $\{X_{i,t},\,t\in\mathbb{Z}\}$ a component process.

We denote the probability measure of the RV $X$ by $P_{X}$ . If $P_{X}$ is absolutely continuous with respect to (w.r.t.) the Lebesgue measure, then we denote its probability density function (PDF) as $f_{X}$ . We denote by $X_{G}$ a Gaussian RV with the same mean and variance as $X$ , and we denote the corresponding Gaussian density as $g_{X}$ .

We define the quantization of a real-valued RV $X$ with precision $m$ as

[TABLE]

where $\lfloor a\rfloor$ is the largest integer less than or equal to $a$ . Likewise, $\lceil a\rceil$ denotes the smallest integer greater than or equal to $a$ . We denote by $[X_{\ell}^{k}]_{m}=([X_{\ell}]_{m},\ldots,[X_{k}]_{m})$ the component-wise quantization of $X_{\ell}^{k}$ (and similarly for other finite or countably infinite collections of RVs and random vectors). For complex RVs $Z$ with real part $R$ and imaginary part $I$ , the quantization $[Z]_{m}$ is equal to $[R]_{m}+\imath[I]_{m}$ . We define $\mathcal{C}(z_{1}^{k},a)\triangleq[z_{1},z_{1}+a)\times\cdots\times[z_{k},z_{k}+a)$ as the $k$ -dimensional hypercube in $\mathbb{R}^{k}$ , with its bottom-left corner at $z_{1}^{k}$ and with sidelength $a$ . For example, we have that $[X_{1}^{k}]_{m}=z_{1}^{k}$ if $X_{1}^{k}\in\mathcal{C}(z_{1}^{k},1/m)$ .

Let $H(\cdot)$ , $h(\cdot)$ , and $D(\cdot\|\cdot)$ denote entropy, differential entropy, and relative entropy, respectively, and let $I(\cdot;\cdot)$ denote mutual information [10]. We take logarithms to base $e\approx 2.718$ , so mutual informations and entropies have dimension nats. The entropy rate of a discrete-valued, stationary, $L$ -variate process $\{\mathbf{X}_{t}\}$ is [10, Sec. 4.2]

[TABLE]

Note that the stationarity of $\{\mathbf{X}_{t}\}$ guarantees that the limit in (3) exists and is equal to [10, Th. 4.2.1]

[TABLE]

We say that a stationary process $\{\mathbf{X}_{t}\}$ is $\psi^{*}$ -mixing if

[TABLE]

where the supremum is over all $A\in\mathcal{F}^{0}_{-\infty}$ and $B\in\mathcal{F}_{k}^{\infty}$ satisfying $P_{\mathbf{X}^{0}_{-\infty}}(A)P_{\mathbf{X}_{k}^{\infty}}(B)>0$ , and where $\mathcal{F}^{0}_{-\infty}$ and $\mathcal{F}_{k}^{\infty}$ are the $\sigma$ -fields generated by $\mathbf{X}^{0}_{-\infty}$ and $\mathbf{X}_{k}^{\infty}$ , respectively. The $\psi^{*}$ -mixing property implies that $\{\mathbf{X}_{t}\}$ is information regular, i.e., [11, pp. 111-112]

[TABLE]

III Rényi Information Dimension

The Rényi information dimension of a collection of RVs $X_{1}^{k}$ is defined as [1]

[TABLE]

When the limit does not exist, we say that the information dimension does not exist. In this case, one may replace the limit either by the limit superior or by the limit inferior (denoted as $\varlimsup$ and $\varliminf$ , respectively)

[TABLE]

and call $\bar{d}(X_{1}^{k})$ and $\underline{d}(X_{1}^{k})$ the upper and lower information dimension of $X_{1}^{k}$ , respectively. Clearly,

[TABLE]

if the limit in (7) exists.

We shall follow this notation throughout the document. Specifically, when reporting results in connection with limits, an overline $\overline{(\cdot)}$ indicates that the quantity in the brackets has been computed using the limit superior, an underline $\underline{(\cdot)}$ indicates that it has been computed using the limit inferior, both an overline and an underline $\overline{\underline{(\cdot)}}$ indicates that a result holds irrespective of whether the limit superior or limit inferior is taken. We write no lines if the limit exists.

Definition 1

For two RVs $X$ and $W$ with joint probability measure $P_{X,W}$ , the conditional information dimension is defined as

[TABLE]

provided the limit exists. If the limit does not exist, then we define the upper and lower conditional information dimension $\bar{d}(X|W)$ and $\underline{d}(X|W)$ by replacing the limit with the limit superior and the limit inferior, respectively.

III-A Properties of Information Dimension

The information dimension of a collection $X_{1}^{k}$ is bounded by the number of RVs in the collection, given the integer part of this collection has finite entropy.

Lemma 1 ([1, eq. (7)], [4, Prop. 1])

Let $X_{1}^{k}$ be a collection of real-valued RVs. If $H([X_{1}^{k}]_{1})<\infty$ , then

[TABLE]

If $H([X_{1}^{k}]_{1})=\infty$ , then $\overline{\underline{d}}(X_{1}^{k})=\infty$ .

Trivially, if $X_{1}^{k}$ is a collection of discrete RVs satisfying $H([X_{1}^{k}]_{1})<\infty$ , then $d(X_{1}^{k})=0$ . Moreover, if the joint distribution of $X_{1}^{k}$ is absolutely continuous w.r.t. the Lebesgue measure on $\mathbb{R}^{k}$ and if $H([X_{1}^{k}]_{1})<\infty$ , then $d(X_{1}^{k})=k$ [1, Th. 4]. More generally, Rényi claims that the information dimension of $X_{1}^{k}$ equals $n<k$ if the joint distribution of $X_{1}^{k}$ is absolutely continuous on some sufficiently smooth $n$ -dimensional manifold in $\mathbb{R}^{k}$ [1, p. 209]. Furthermore, if $X$ is a real-valued RV satisfying $H([X]_{1})<\infty$ and with probability measure

[TABLE]

where $P_{d}$ is a discrete measure, $P_{c}$ is an absolutely-continuous measure, and $0\leq\rho\leq 1$ , then [1, Th. 3]

[TABLE]

Two well-known properties of entropy are that it is reduced by conditioning [10, Th. 2.6.5] and that it obeys a chain rule. Furthermore, the conditional entropy of $X$ given $Y$ can be computed by first calculating the entropy conditioned on the event that $Y=y$ , and by then averaging over $Y$ . The corresponding results for information dimension are presented in the following three lemmas.

Lemma 2

Suppose that $H([X]_{1})<\infty$ . Then, we have for any two RVs $X$ and $Y$

[TABLE]

Consequently, if $d(X|Y=y)$ exists $P_{Y}$ -almost surely, then the limit in (10) exists and

[TABLE]

Proof:

See Appendix A-A. ∎

Lemma 3

For any two RVs $X$ and $Y$ , we have

[TABLE]

with equality if $X$ and $Y$ are independent.

Proof:

Since conditioning reduces entropy, we have $H([X]_{m}|Y)\leq H([X]_{m})$ , with equality if $X$ and $Y$ are independent. The lemma follows by dividing both sides of the inequality by $\log m$ and taking limits as $m\to\infty$ . ∎

Lemma 4

For the collection of RVs $X_{1}^{k}$ , we have

[TABLE]

Proof:

See Appendix A-B. ∎

The left-most inequality in (17) holds with equality if all information dimensions exist and the RVs $X_{1}^{k}$ are independent. There are examples where the right-most inequality is strict.

Example 1

Let $(X_{1},X_{2})$ be uniformly distributed on $[0,1]^{2}$ and let $Y=g(X_{1},X_{2})$ , where $g{:}\ [0,1]^{2}\to[0,1]$ is bijective. Such a function can be constructed (see also the discussion in [4, Section IV.B]). Since $g$ is bijective, we have $d(Y|X_{1},X_{2})=d(X_{1},X_{2}|Y)=0$ . Moreover, since $(X_{1},X_{2})$ is uniformly distributed on $[0,1]^{2}$ , we have $d(X_{1},X_{2})=2$ . Finally, we have $d(Y)\leq 1$ by Lemma 1. From Lemma 4, we get

[TABLE]

However, we also have

[TABLE]

It follows that

[TABLE]

so the chain rule $d(X_{1},X_{2},Y)\geq d(Y)+d(X_{1},X_{2}|Y)$ holds with strict inequality.

The above example not only demonstrates that the chain rule for information dimension may hold with strict inequality, it also shows that the order in which the chain rule is expanded can be crucial.

III-B Information Dimension of Finite-Variance RVs

For RVs $X_{1}^{k}$ that have a finite variance, the upper bound on $\bar{d}(X_{1}^{k})$ presented in Lemma 1 can be tightened. To this end, we introduce further notation. We denote the covariance matrix of the vector $\mathbf{X}=(X_{1},\dots,X_{k})^{\textnormal{{\tiny T}}}$ by $C_{X_{1}^{k}}$ . Furthermore, the cross-covariance matrix between $\mathbf{X}=(X_{1},\dots,X_{k})^{\textnormal{{\tiny T}}}$ and $\mathbf{Y}=(Y_{1},\dots,Y_{k})^{\textnormal{{\tiny T}}}$ is denoted by $C_{X_{1}^{k}Y_{1}^{k}}$ , and the covariance matrix of the vector $(X_{1},\ldots,X_{k},Y_{1},\ldots,Y_{k})^{\textnormal{{\tiny T}}}$ is denoted by $C_{X_{1}^{k},Y_{1}^{k}}$ . Clearly,

[TABLE]

One can show that the information dimension of a collection of real-valued RVs $X_{1}^{k}$ cannot exceed the rank of its covariance matrix, i.e.,

[TABLE]

This agrees with the intuition that linearly-dependent components of $X_{1}^{k}$ do not contribute to the information dimension. One can further show that collections of Gaussian RVs achieve this upper bound with equality. Thus, among all RVs with a given covariance structure, the Gaussian RV maximizes information dimension. These results follow directly from the more general results for stochastic processes (Theorem 10) in Section IV.

The next theorem evaluates the conditional information dimension of $X_{1}^{k}$ given $Y_{1}^{\ell}$ for jointly Gaussian RVs $(X_{1}^{k},Y_{1}^{\ell})$ .

Theorem 5

Let $(X_{1}^{k},Y_{1}^{\ell})$ be a collection of real-valued, jointly Gaussian RVs. The conditional information dimension of $X_{1}^{k}$ given $Y_{1}^{\ell}$ is equal to

[TABLE]

where $C_{X_{1}^{k}|Y_{1}^{\ell}}$ is the generalized Schur complement of $C_{Y_{1}^{\ell}}$ in $C_{X_{1}^{k},Y_{1}^{\ell}}$ .

Proof:

See Appendix A-C. ∎

Theorem 5 implies that the chain rule in Lemma 4 holds with equality for Gaussian RVs. Indeed, if $X_{1}^{k}$ is a collection of real-valued, jointly Gaussian RVs, then we have $d(X_{1}^{k})=\mathrm{rank}(C_{X_{1}^{k}})$ and $d(X_{1}^{\ell})=\mathrm{rank}(C_{X_{1}^{\ell}})$ . Moreover, by Theorem 5, $d(X_{\ell+1}^{k}|X_{1}^{\ell})$ equals the rank of the generalized Schur complement of $C_{X_{1}^{\ell}}$ in $C_{X_{1}^{k}}$ , denoted by $C_{X_{\ell+1}^{k}|X_{1}^{\ell}}$ . Since the rank of $C_{X_{1}^{k}}$ can be written as the sum of the ranks of $C_{X_{1}^{\ell}}$ and $C_{X_{\ell+1}^{k}|X_{1}^{\ell}}$ [12, 7.1.P28], the claim follows.

IV The Information Dimension Rate

We next propose the information dimension rate as a generalization of information dimension to stochastic processes. We define the information dimension rate for general (possibly non-stationary) processes. However, for the sake of simplicity, most of our results will then be presented for stationary processes.

Definition 2

The information dimension rate of the $L$ -variate stochastic process $\{\mathbf{X}_{t}\}$ is defined as

[TABLE]

provided the limits exist. If the limits do not exist, then we define the upper and lower information dimension rate $\bar{d}(\{\mathbf{X}_{t}\})$ and $\underline{d}(\{\mathbf{X}_{t}\})$ by replacing the limits with the limits superior and limits inferior, respectively.

IV-A Properties of the Information Dimension Rate

The information dimension rate satisfies properties similar to those presented in Lemma 1 for the information dimension. We summarize them in the following lemma.

Lemma 6

Let $\{\mathbf{X}_{t}\}$ be a stationary, $L$ -variate, real-valued process. If $H([\mathbf{X}_{1}]_{1})<\infty$ , then

[TABLE]

If $H([\mathbf{X}_{1}]_{1})=\infty$ , then $\overline{\underline{d}}(\{\mathbf{X}_{t}\})=\infty$ .

Proof:

Suppose first that $H([\mathbf{X}_{1}]_{1})<\infty$ . Then, the rightmost inequality in (25) follows from (11). The left-most inequality follows from the nonnegativity of entropy. Finally, the center inequality follows since conditioning reduces entropy, hence $H^{\prime}(\{[\mathbf{X}_{t}]_{m}\})\leq H([\mathbf{X}_{1}]_{m})$ .

Now suppose that $H([\mathbf{X}_{1}]_{1})=\infty$ . Since $[\mathbf{X}_{1}]_{1}$ is a function of $[\mathbf{X}_{1}^{k}]_{m}$ for every $m,k\in\mathbb{N}$ , we have

[TABLE]

This implies that $H^{\prime}(\{[\mathbf{X}_{t}]_{m}\})=\infty$ , and the claim that $\overline{\underline{d}}(\{\mathbf{X}_{t}\})=\infty$ follows from Definition 2. ∎

The next result discusses how Lipschitz transformations affect the information dimension rate.

Lemma 7

Let $\{\mathbf{X}_{t}\}$ be a stationary, $L$ -variate, real-valued process, and let $\{f_{t},t\in\mathbb{Z}\}$ be a sequence of Lipschitz functions from $\mathbb{R}^{L}$ to $\mathbb{R}^{M}$ with Lipschitz constants $\mathsf{K}_{t}$ satisfying

[TABLE]

Then,

[TABLE]

Proof:

See Appendix B. ∎

If $\{f_{t}\}$ is a sequence of bi-Lipschitz functions with uniformly-bounded Lipschitz constants, then Lemma 7 implies that $\overline{\underline{d}}(\{f_{t}(\mathbf{X}_{t})\})=\overline{\underline{d}}(\{\mathbf{X}_{t}\})$ . As a corollary, we thus obtain that the information dimension rate is invariant under scaling and translation. More generally, it follows that, if $\{\mathbf{c}_{t}\}$ and $\{W_{t}\}$ are sequences of $L$ -variate vectors and $(L\times L)$ -dimensional matrices, the latter satisfying $\sup_{t\in\mathbb{Z}}\|W_{t}\|<\infty$ and $\sup_{t\in\mathbb{Z}}\|W_{t}^{-1}\|<\infty$ for some induced matrix norm $\|\cdot\|$ , then

[TABLE]

Since the information dimension rate of an i.i.d. process equals the information dimension of its marginal RVs, we further recover the well-known result that the information dimension of collections of RVs is invariant under scaling and translation [13, Lemma 3].

The next lemma shows that the information dimension rate of a collection of stochastic processes is unaffected by those that have zero information dimension rate.

Lemma 8

Let $\{\mathbf{X}_{t}\}$ and $\{\mathbf{Z}_{t}\}$ be two jointly stationary, $L$ -variate, real-valued processes, and assume that $d(\{\mathbf{Z}_{t}\})=0$ . Then,

[TABLE]

Moreover, if $Z$ is discrete with $H(Z)<\infty$ , then we further have

[TABLE]

Proof:

See Appendix C. ∎

Inter alia, Lemma 8 can be used to compute the information dimension rate of a countable mixture of stochastic processes. For example, specialized to i.i.d. processes, (32) together with Lemma 2 recovers (13) by choosing $X_{1}\sim P_{d}$ , $X_{2}\sim P_{c}$ , and $P_{Z}(1)=1-P_{Z}(2)=1-\rho$ .

IV-B Information Dimension Rate vs. Rate-Distortion Dimension

Let $R(\mathbf{X}_{1}^{k},D)$ denote the rate-distortion function of the source $\mathbf{X}_{1}^{k}$ , i.e.,

[TABLE]

where the infimum is over all conditional distributions of $\hat{\mathbf{X}}_{1}^{k}$ given $\mathbf{X}_{1}^{k}$ such that

[TABLE]

and where $\|\cdot\|_{2}$ denotes the Euclidean norm. We have the following definition.

Definition 3

The rate-distortion dimension of the $L$ -variate stochastic process $\{\mathbf{X}_{t}\}$ is defined as

[TABLE]

provided the limits over $D$ and $k$ exist. (When the process $\{\mathbf{X}_{t}\}$ is stationary, the limit over $k$ always exists [14, Th. 9.8.1].) If the limits do not exist, then we define the upper and lower rate-distortion dimension $\overline{\mathsf{dim}}_{R}(\{\mathbf{X}_{t}\})$ and $\underline{\mathsf{dim}}_{R}(\{\mathbf{X}_{t}\})$ by replacing the limits with the limits superior and limits inferior, respectively.

Intuitively, the rate-distortion function

[TABLE]

corresponds to the minimum number of nats per source symbol required to compress a stationary and ergodic source $\{\mathbf{X}_{t}\}$ with a vector quantizer of average per-symbol distortion not exceeding $D$ [14, Sec. 9.8]. The rate-distortion dimension characterizes the growth of $R(D)$ as $D$ vanishes. For example, for an i.i.d. Gaussian source with variance $\sigma^{2}$ , we have [10, Th. 13.3.2]

[TABLE]

where $\mathbf{1}\!\left\{\cdot\right\}$ denotes the indicator function. Observe that in this case $R(D)$ grows like $1/2\log(1/D)$ as $D\to 0$ . The rate-distortion dimension corresponds to twice the pre-log factor of the rate-distortion function $R(D)$ , which in this case is $1$ .

In contrast, the information dimension rate characterizes the growth of the entropy rate $H^{\prime}(\{[\mathbf{X}_{t}]_{m}\})$ as $m$ increases. This entropy rate, in turn, corresponds essentially to the number of nats per source symbol required to compress each symbol $\mathbf{X}_{t}$ of a stationary and ergodic source $\{\mathbf{X}_{t}\}$ with a uniform quantizer of step size $1/m$ . Since a symbol-wise, uniform quantizer cannot outperform the best vector quantizer, it follows that the information dimension rate is lower-bounded by the rate-distortion dimension.

For RVs, Kawabata and Dembo showed that the rate-distortion dimension is actually equal to its information dimension [2, Prop. 3.3]. Thus, a symbol-wise, uniform quantizer achieves the same information dimension as the best vector quantizer. The following theorem generalizes this result to stochastic processes.

Theorem 9

For any $L$ -variate, real-valued process $\{\mathbf{X}_{t}\}$ ,

[TABLE]

Proof:

See Appendix D. ∎

Note that Theorem 9 also holds for non-stationary processes.

IV-C Information Dimension Rate of Finite-Variance Processes

Let $\{\mathbf{X}_{t}\}$ be a stationary, $L$ -variate, real-valued process with mean vector ${\boldsymbol{\mu}}$ and (matrix-valued) spectral distribution function (SDF) $\theta\mapsto\mathsf{F}_{\mathbf{X}}(\theta)$ . Thus, $\mathsf{F}_{\mathbf{X}}$ is a bounded, non-decreasing, and right-continuous function on $[-1/2,1/2]$ such that the autocovariance function

[TABLE]

is given by the Lebesgue-Stieltjes integral [15, (7.3), p. 141]

[TABLE]

It follows that the $(i,j)$ -th element of $\mathsf{F}_{\mathbf{X}}$ is the cross SDF $\theta\mapsto\mathsf{F}_{X_{i}X_{j}}(\theta)$ of the component processes $\{X_{i,t}\}$ and $\{X_{j,t}\}$ , i.e.,

[TABLE]

where

[TABLE]

denotes the cross-covariance function. It further follows that the diagonal elements of $\mathsf{F}_{\mathbf{X}}$ are real and non-decreasing, and they satisfy $\mathsf{F}_{X_{i}}(1/2)-\mathsf{F}_{X_{i}}(-1/2)=\sigma_{i}^{2}$ , where $\sigma_{i}$ denotes the standard deviation of $X_{i,t}$ . It can be shown that $\theta\mapsto\mathsf{F}_{\mathbf{X}}(\theta)$ has a derivative almost everywhere, which has positive semi-definite, Hermitian values [15, (7.4), p. 141]. We shall denote the derivative of $\mathsf{F}_{\mathbf{X}}$ by $\mathsf{F}^{\prime}_{\mathbf{X}}$ . When $\mathsf{F}_{\mathbf{X}}$ is absolutely continuous w.r.t. the Lebesgue measure, its derivative $\mathsf{F}^{\prime}_{\mathbf{X}}$ coincides with the PSD $\mathsf{S}_{\mathbf{X}}$ of $\{\mathbf{X}_{t}\}$ .

The following theorem shows that, among all processes of a given SDF, the Gaussian process maximizes the information dimension rate. It further characterizes the information dimension rate of such processes in terms of the SDF.

Theorem 10

Let $\{\mathbf{X}_{t}\}$ be a stationary, $L$ -variate, real-valued process with SDF $\mathsf{F}_{\mathbf{X}}$ . Then,

[TABLE]

with equality if $\{\mathbf{X}_{t}\}$ is Gaussian.

Proof:

See Appendix E. ∎

In order to prove Theorem 10, we invoke Bussgang’s theorem to obtain an expression for the SDF of a quantized Gaussian process $\{[\mathbf{X}_{t}]_{m}\}$ as a function of the SDF of the original process $\{\mathbf{X}_{t}\}$ . Since we believe that this result is interesting on its own, we present it below.

Lemma 11

Let $\{\mathbf{X}_{t}\}$ be a stationary, $L$ -variate, real-valued, Gaussian process with mean vector ${\boldsymbol{\mu}}$ and SDF $\mathsf{F}_{\mathbf{X}}$ . Then, the $(i,j)$ -th entry of the SDF $\theta\mapsto\mathsf{F}_{[\mathbf{X}]_{m}}(\theta)$ of $\{[\mathbf{X}_{t}]_{m}\}$ satisfies

[TABLE]

where $N_{i,t}\triangleq X_{i,t}-[X_{i,t}]_{m}$ and

[TABLE]

(In (45), $\mu_{i}$ and $\sigma_{i}$ denote the mean and standard deviation of $X_{i,t}$ .) For every $i=1,\dots,L$ , we have

[TABLE]

and

[TABLE]

Moreover, if all component processes have zero mean and unit variance, then $a_{1}=\ldots=a_{L}$ and

[TABLE]

Proof:

See Appendix F. ∎

As a corrolary to Theorem 10, we obtain that for univariate, stationary, Gaussian processes with PSD $\mathsf{S}_{X}$ , the information dimension rate is equal to the Lebesgue measure of the set of harmonics on $[-1/2,1/2]$ where $\mathsf{S}_{X}$ is positive, i.e.,

[TABLE]

where $\lambda(\cdot)$ denotes the Lebesgue measure. As pointed out by one of the reviewers, (49) can also be obtained directly by using the equivalence of information dimension rate and rate-distortion dimension (Theorem 9) together with the parametric representation of the rate-distortion function [14, eqs. (9.7.42) & (9.7.43)]

[TABLE]

for $\beta>0$ , where $\mathcal{B}_{\beta}\triangleq\{\theta\in[-1/2,1/2]\colon\mathsf{S}_{X}(\theta)>\beta\}$ . Indeed, when $\lambda(\mathcal{B}_{0})$ is zero, we have $d(\{X_{t}\})=0$ since in this case the process $\{X_{t}\}$ has zero variance and, hence, the entropy rate of the quantized process $\{[X_{t}]_{m}\}$ is zero, too. When $\lambda(\mathcal{B}_{0})$ is strictly positive, the distortion $D_{\beta}$ can be bounded as

[TABLE]

It follows by the continuity of the Lebesgue measure that $D_{\beta}/\beta\to\lambda(\mathcal{B}_{0})$ as $\beta\to 0$ . Consequently, $D_{\beta}\to 0$ if, and only if, $\beta\to 0$ and the rate-distortion dimension can be written as

[TABLE]

By the continuity of the Lebesgue measure, for every $\varepsilon>0$ there exists a $\beta^{\prime}\in(0,1)$ such that $\lambda(\mathcal{B}_{\beta^{\prime}})\geq\lambda(\mathcal{B}_{0})-\varepsilon$ . Since $\mathcal{B}_{\beta}\subseteq\mathcal{B}_{0}$ , it follows that

[TABLE]

Thus, for every $0<\beta<\beta^{\prime}<1$ ,

[TABLE]

Dividing both sides of (55) by $-\log\beta$ , and letting first $\beta$ and then $\varepsilon$ tend to zero, we obtain that the second term on the RHS of (53) is nonnegative. However, by assumption the process $\{X_{t}\}$ has finite variance, so its PSD $\mathsf{S}_{X}$ is integrable over $[-1/2,1/2]$ . Consequently, using the inequality $\log x\leq x-1$ and the nonnegativity of $\mathsf{S}_{X}$ , we obtain that

[TABLE]

Dividing both sides of (56) by $-\log\beta$ , and letting $\beta$ tend to zero, we obtain that the second term on the RHS of (53) is also nonpositive. We conclude that this term is zero, so (49) follows from (53) and Theorem 9.

We observe from Theorem 10 that the information dimension rate of a Gaussian process $\{\mathbf{X}_{t}\}$ depends only on the derivative of its SDF $\mathsf{F}_{\mathbf{X}}$ , which coincides almost everywhere with the derivative of the absolutely-continuous part of $\mathsf{F}_{\mathbf{X}}$ . Indeed, any SDF $\mathsf{F}_{\mathbf{X}}$ can be decomposed as [15, (4.3), p. 124]

[TABLE]

where $\mathsf{F}_{\mathbf{X},a}$ is absolutely continuous w.r.t. the Lebesgue measure, $\mathsf{F}_{\mathbf{X},d}$ is discrete, and $\mathsf{F}_{\mathbf{X},s}$ is singular. Furthermore, $\mathsf{F}^{\prime}_{\mathbf{X}}(\theta)=\mathsf{F}^{\prime}_{\mathbf{X},a}(\theta)$ almost everywhere [15, Sec. 4]. Consequently, the information dimension rate of a Gaussian process depends only on the absolutely-continuous part of its SDF. By combining (57) with Theorem 10 and Lemma 8, we can show that the same is true for non-Gaussian processes.

Corollary 12

Let $\{\mathbf{X}_{t}\}$ be a stationary, $L$ -variate, real-valued process with SDF $\mathsf{F}_{\mathbf{X}}$ , and let $\{\mathbf{X}_{t,a}\}$ be a stationary, $L$ -variate, real-valued process with SDF $\mathsf{F}_{\mathbf{X},a}$ , where $\mathsf{F}_{\mathbf{X},a}$ is the absolutely-continuous part of $\mathsf{F}_{\mathbf{X}}$ , cf. (57). Then

[TABLE]

Proof:

Combining the decomposition (57) with the spectral representation of stationary processes [16, Sec. 4.11], it can be shown that every stationary process can be written as

[TABLE]

where $\{\mathbf{X}_{t,a}\}$ , $\{\mathbf{X}_{t,d}\}$ , and $\{\mathbf{X}_{t,s}\}$ are stationary, mutually uncorrelated, stochastic processes with the respective SDFs $\mathsf{F}_{\mathbf{X},a}$ , $\mathsf{F}_{\mathbf{X},d}$ , and $\mathsf{F}_{\mathbf{X},s}$ ; see [16, p. 758] and references therein. Since $\mathsf{F}^{\prime}_{\mathbf{X},d}$ and $\mathsf{F}^{\prime}_{\mathbf{X},s}$ are zero almost everywhere [15, Sec. 4], we obtain from Theorem 10 and the nonnegativity of the information dimension rate (Lemma 6) that

[TABLE]

Corollary 12 follows by applying Lemma 8 first together with (60) to show that

[TABLE]

and then together with (61) to show that

[TABLE]

∎

IV-D Information Dimension Rate of Complex-Valued Processes

So far, we have considered real-valued stochastic processes. However, every complex-valued RV can be written as a two-dimensional, real-valued, random vector, so the previous results directly generalize to the complex case. In particular, one can define the information dimension rate of the $L$ -variate, complex-valued process $\{\mathbf{Z}_{t}\}$ as the information dimension rate of the $(2L)$ -variate, real-valued process $\{\hat{\mathbf{X}}_{t}\}$ that follows by stacking the real part of $\mathbf{Z}_{t}$ on top of the imaginary part of $\mathbf{Z}_{t}$ .

Let $\{\mathbf{Z}_{t}\}$ be a stationary, $L$ -variate, complex-valued process with mean vector ${\boldsymbol{\mu}}$ and matrix-valued SDF $\mathsf{F}_{\mathbf{Z}}$ , i.e.,

[TABLE]

where

[TABLE]

is the autocovariance function. We say that a stationary, $L$ -variate, complex-valued process $\{\mathbf{Z}_{t}\}$ is proper if it has finite variance, its mean vector is the zero vector, and its pseudo-autocovariance function satisfies

[TABLE]

The following result generalizes Theorem 10 to complex-valued stochastic processes.

Theorem 13

Let $\{\mathbf{Z}_{t}\}$ be a stationary, $L$ -variate, complex-valued process with matrix-valued SDF $\mathsf{F}_{\mathbf{Z}}$ . Then,

[TABLE]

with equality if $\{\mathbf{Z}_{t}\}$ is Gaussian and proper.

Proof:

See Appendix G. ∎

Note that neither Gaussianity nor properness is sufficient for equality in Theorem 13. Conversely, Gaussianity and properness are not necessary for equality. For example, any univariate stationary Gaussian process achieves (66) with equality if its real and imaginary components are independent and if the derivatives of their SDFs have matching support.

V Another Definition of Information Dimension

Jalali and Poor [7] proposed a different definition for the information dimension of a univariate stochastic process. We shall refer to this information dimension as the block-average information dimension and denote it by $d^{\prime}(\{X_{t}\})$ . In this section, we discuss scenarios in which the information dimension rate (Definition 2) coincides with and differs from the block-average information dimension. For ease of exposition, in this section we follow [7] and restrict our attention to univariate real-valued processes.

The following definition for the information dimension of stochastic processes was proposed in [7].

Definition 4

The block-average information dimension of the stochastic process $\{X_{t}\}$ is defined as

[TABLE]

provided the limits exist. If the limits do not exist, then one can define the upper and lower block-average information dimension $\overline{d}^{\prime}(\{X_{t}\})$ and $\underline{d}^{\prime}(\{X_{t}\})$ by replacing the limits by limits superior and limits inferior, respectively.

In the following, we restrict ourselves to stationary processes, in which case the limit over $k$ in (67) is guaranteed to exist. We refer to $d^{\prime}(\{X_{t}\})$ as the block-average information dimension because it was shown in [7, Lemma 3] that, if $\{X_{t}\}$ is stationary and the information dimension $d(X_{1}^{k})$ exists for every $k$ , then

[TABLE]

If $d(X_{1}^{k})$ does not exist, then the proof of [7, Lemma 3] reveals that

[TABLE]

Since conditioning reduces entropy, it follows immediately that

[TABLE]

Thus, like the information dimension rate, the block-average information dimension of the stochastic process $\{X_{t}\}$ cannot exceed the information dimension of the marginal RV $X_{t}$ .

While the entropy rate $H^{\prime}(\{X_{t}\})$ of a stationary process $\{X_{t}\}$ can alternatively be written as the conditional entropy of $X_{1}$ given $X_{-\infty}^{0}$ , cf. (4), the block-average information dimension $d^{\prime}(\{X_{t}\})$ does, in general, not permit a similar expression. In fact, let

[TABLE]

provided the limit over $m$ exists. (Since conditioning reduces entropy, the limit over $k$ always exists.) The upper and lower information dimensions $\bar{d}(X_{1}|X_{-\infty}^{0})$ and $\underline{d}(X_{1}|X_{-\infty}^{0})$ are defined analogously by replacing the limit over $m$ by the limit superior and limit inferior, respectively. Then, we have that

[TABLE]

where the inequality can be strict; see Theorem 14 and Example 4 below.

V-A *Block-Average Information Dimension vs.

Information Dimension Rate*

We next demonstrate that, for $\psi^{*}$ -mixing processes, the information dimension rate $d(\{X_{t}\})$ coincides with the block-average information dimension $d^{\prime}(\{X_{t}\})$ . However, in general the two definitions do not coincide, but there exists an ordering between them.

Theorem 14

Let $\{X_{t}\}$ be a stationary process. Then,

[TABLE]

Moreover,

[TABLE]

where the limits over $k$ exist because, by the stationarity of $\{X_{t}\}$ , the mutual information $I([X_{k}]_{m};[X_{-\infty}^{0}]_{m}|[X_{1}^{k-1}]_{m})$ is monotonically decreasing in $k$ .

Proof:

See Appendix H-A. ∎

The inequalities in (74) imply that, if the limits over $m$ exist, then

[TABLE]

is a necessary and sufficient condition for the equality of $\overline{\underline{d}}(\{X_{t}\})$ and $\overline{\underline{d}}^{\prime}(\{X_{t}\})$ . Note that, for every $m=2,3,\ldots$ , we have [17, eq. (8.9)]

[TABLE]

Thus, (75) is satisfied for processes $\{X_{t}\}$ that allow us to change the order of taking limits as $k$ and $m$ tend to infinity. However, in general (75) is difficult to check. We next present a sufficient condition that is easier to verify.

Corollary 15

Let $\{X_{t}\}$ be a stationary process. Assume that there exists a nonnegative integer $n$ such that

[TABLE]

Then, $\overline{\underline{d}}(\{X_{t}\})=\overline{\underline{d}}^{\prime}(\{X_{t}\})$ .

Proof:

See Appendix H-B. ∎

Condition (77) holds for $\psi^{*}$ -mixing processes. Indeed, since every $\psi^{*}$ -mixing process satisfies (6), it follows that one can find an $n$ such that $I(X_{1}^{\infty};X_{-\infty}^{-n})<\infty$ . The condition (77) holds then by the data processing inequality.

If (77) holds for $n=0$ , then we even have that

[TABLE]

Thus, in this case all presented generalizations of information dimension to stochastic processes coincide with the information dimension of the marginal RV. To prove (78), we note that (77) with $n=0$ gives

[TABLE]

It then follows by the data processing inequality that

[TABLE]

Consequently,

[TABLE]

if the limit exists. In general, we have $\overline{\underline{d}}(X_{1}|X_{-\infty}^{0})=\overline{\underline{d}}(X_{t})$ . The claim (78) follows then by (73) and because, by (70), $\overline{\underline{d}}^{\prime}(\{X_{t}\})\leq\overline{\underline{d}}(X_{t})$ .

Condition (77) with $n=0$ is satisfied, for example, if $\{X_{t}\}$ is a sequence of i.i.d. RVs, if it is a discrete-valued stochastic process with finite marginal entropy, or if it is a continuous-valued stochastic process with finite marginal differential entropy and finite differential entropy rate.

In the following, we present two examples of processes $\{X_{t}\}$ for which $\overline{\underline{d}}(X_{1}|X_{-\infty}^{0})=\overline{\underline{d}}(\{X_{t}\})=\overline{\underline{d}}^{\prime}(\{X_{t}\})$ . As we shall argue, neither of these examples satisfies (77), hence (77) is sufficient but not necessary.

Example 2

Let $\{B_{t}\}$ be a sequence of i.i.d. Bernoulli- $\rho$ RVs, i.e., $P_{B_{t}}(0)=1-P_{B_{t}}(1)=\rho$ , and let $\{Y_{t}\}$ be a sequence of i.i.d. RVs with PDF $f_{Y}$ supported on $[0,1]$ and finite differential entropy. By (13), we thus have that $d(Y_{t})=1$ for every $t$ . We define the stochastic process $\{X_{t}\}$ as

[TABLE]

and assume that $X_{t}$ has the same marginal distribution as $Y_{t}$ . Note that $\{X_{t}\}$ is first-order Markov, so

[TABLE]

Furthermore, [7, Th. 3] demonstrates that $d^{\prime}(\{X_{t}\})=\rho$ . Thus, together with (73), this yields that

[TABLE]

The stochastic process $\{X_{t}\}$ , as defined by (82), satisfies (75) but not (77). Indeed, for every nonnegative integer $n$ , we have $I(X_{1};X_{-n})=\infty$ , since $X_{1}$ has finite differential entropy and the event $X_{1}=X_{-n}$ has positive probability. It follows that $I(X_{1}^{k};X_{-\infty}^{-n})=\infty$ for every $k$ and $n$ , so (77) is violated. In contrast, we have

[TABLE]

since conditioning on the binary random variable $B_{k}$ changes mutual information by at most one bit. If $B_{k}=1$ , then $[X_{k}]_{m}=[X_{k-1}]_{m}$ ; if $B_{k}=0$ , then $[X_{k}]_{m}=[Y_{k}]_{m}$ , which is independent of $[X_{-\infty}^{k-1}]_{m}$ . In both cases, the conditional mutual information between $[X_{k}]_{m}$ and $[X_{-\infty}^{0}]_{m}$ given $[X_{1}^{k-1}]_{m}$ is zero, so (75) is satisfied.

Example 3

Let the process $\{\tilde{X}_{t}\}$ be periodic with period $P\in\mathbb{N}$ and have finite marginal differential entropy. Further let $\Delta$ be uniformly distributed on $\{0,\dots,P-1\}$ . Then, the shifted process $\{X_{t}\}$ , defined by

[TABLE]

is stationary [18, Th. 10-5] and has finite marginal differential entropy. For every $k=P,P+1,\ldots$ and $m=2,3,\ldots$ , we have that $H([X_{1}]_{m}|X_{-k+1}^{0})=0$ and $H([X_{1}^{k}]_{m})=H([X_{1}^{P}]_{m})$ , hence

[TABLE]

As in the previous example, the stochastic process $\{X_{t}\}$ satisfies (75) but not (77). Indeed, for every nonnegative integer $n$ , we have $I(X_{1}^{k};X_{-\infty}^{-n})=\infty$ since $X_{1}$ has finite differential entropy and the process is periodic. In contrast, $[X_{k}]_{m}=[X_{k-P}]_{m}$ , so the conditional mutual information between $[X_{k}]_{m}$ and $[X_{-\infty}^{0}]_{m}$ given $[X_{1}^{k-1}]_{m}$ is zero when $k=P+1,P+2\ldots$

In many cases, the inequalities in Theorem 14 can be strict. The following example shows such a strict inequality for the class of stationary Gaussian processes $\{X_{t}\}$ with PSD supported on a set of positive Lebesgue measure.333The assumption that $\{X_{t}\}$ has a PSD is made for notational convenience and is not essential. All steps in Example 4 continue to hold if we replace $\mathsf{S}_{X}$ by the derivative of the SDF $\mathsf{F}_{X}$ .

Example 4

Let $\{X_{t}\}$ be a stationary Gaussian process with zero mean, variance $\sigma^{2}$ , and PSD $\mathsf{S}_{X}$ having support $\mathcal{B}_{0}$ . It follows from Theorem 10 that

[TABLE]

We next argue that if $0<\lambda(\mathcal{B}_{0})<1$ then $d^{\prime}(\{X_{t}\})=1$ and $d(X_{1}|X_{-\infty}^{0})=0$ . Consequently,

[TABLE]

To show that $d^{\prime}(\{X_{t}\})=1$ , we note that

[TABLE]

where the inequality follows by the stationarity of $\{X_{t}\}$ ; because conditioning reduces entropy; and because, conditioned on $X_{-k}^{-1}$ , $[X_{0}]_{m}$ is independent of $[X_{-k}^{-1}]_{m}$ . Since $\{X_{t}\}$ is Gaussian, it follows that, conditioned on $X_{-k}^{-1}$ , the RV $X_{0}$ is Gaussian with mean $\mathsf{E}[X_{0}|X_{-k}^{-1}]$ and variance $\sigma_{k}^{2}$ , which is independent of $X_{-k}^{-1}$ . It can be further shown that if $\lambda(\mathcal{B}_{0})>0$ , then $\sigma_{k}^{2}>0$ for every finite $k$ (see Lemma 16 below). It follows that, conditioned on any $X_{-k}^{-1}=x_{-k}^{-1}$ , the RV $X_{0}$ has a PDF, so by (13)

[TABLE]

Together with Fatou’s lemma, this shows that the RHS of (90) is $1$ , hence $d^{\prime}(\{X_{t}\})=1$ .

To demonstrate that $d(X_{1}|X_{-\infty}^{0})=0$ , we note that $\lambda(\mathcal{B}_{0})<1$ implies that

[TABLE]

This is a necessary and sufficient condition for $\sigma_{k}^{2}\to 0$ as $k\to\infty$ ; see, e.g., [19, Sec. 10.6]. Intuitively, the fact that $\sigma_{k}^{2}\to 0$ implies that the conditional distribution of $X_{1}$ given $X_{-\infty}^{0}$ is almost surely degenerate, hence $d(X_{1}|X_{-\infty}^{0})=0$ . To prove this rigorously, we apply [13, Lemma 30] together with the fact that conditioning reduces entropy to upper-bound

[TABLE]

Expressing $X_{1}-\mathsf{E}[X_{1}|X_{-k}^{0}]$ as $\sigma_{k+1}Z$ , where $Z$ is zero-mean, unit-variance Gaussian, the RHS of (93) can be written as $H(\lfloor m\sigma_{k+1}Z\rfloor)+\log 2$ . Since $\sigma_{k}^{2}\to 0$ as $k\to\infty$ , we obtain from [3, Lemma 1] that

[TABLE]

Consequently, the claim follows from the definition of $d(X_{1}|X_{-\infty}^{0})$ .

Lemma 16

Let $\{X_{t}\}$ be a stationary, univariate, real-valued, Gaussian process with zero mean, variance $\sigma^{2}$ , and SDF $\mathsf{F}_{X}$ . Suppose that $\sigma_{k}^{2}=0$ for some finite $k$ . Then,

[TABLE]

Proof:

See Appendix H-C. ∎

V-B *Block-Average Information Dimension vs.

Rate-Distortion Dimension*

The connection between the block-average information dimension and the rate-distortion dimension of a stochastic process was studied in [8]. The equivalence between the rate-distortion dimension and the information dimension [2, Prop. 3.3] directly implies that

[TABLE]

Rezagah et al. [8] demonstrated that the order of the limits on the RHS of (96) can be exchanged. More precisely, [8, Th. 2] states that if $\lim_{D\to 0}\frac{R(X_{1}^{k},kD)}{-k\log D}$ exists for all $k$ , then

[TABLE]

This may appear as a contradiction to our results, since we demonstrate in Theorem 9 that $\overline{\underline{\mathsf{dim}}}_{R}(\{X_{t}\})=\overline{\underline{d}}(\{X_{t}\})$ , and Example 4 demonstrates that there are stochastic processes for which $\overline{\underline{d}}(\{X_{t}\})<\overline{\underline{d}}^{\prime}(\{X_{t}\})$ . However, the proof of (97) relies on the fact that [8, Sec. VI-E]

[TABLE]

and that the RHS of (98) vanishes as $k\to\infty$ . If (79) holds, then this is indeed the case; see [17, eqs. (8.6)–(8.10)]. As shown in Corollary 15, in this case we also have that $\overline{\underline{d}}^{\prime}(\{X_{t}\})=\overline{\underline{d}}(\{X_{t}\})$ . In fact, as discussed after Corollary 15, in this case all presented generalizations of information dimension to stochastic processes coincide with the information dimension of the marginal RV. In contrast, if (79) does not hold then, by the data processing inequality, the RHS of (98) is infinite. This is, for example, the case if $\{X_{t}\}$ is a stationary process with positive variance and a PSD that is zero on a set of positive Lebesgue measure, since for such processes the differential entropy $h(X_{1}|X_{-\infty}^{0})$ is $-\infty$ . Our proof of Theorem 9 does not rely on (98). We thus conclude that $\overline{\underline{\mathsf{dim}}}_{R}(\{X_{t}\})=\overline{\underline{d}}(\{X_{t}\})$ for every stochastic process $\{X_{t}\}$ , but that $\overline{\underline{\mathsf{dim}}}_{R}(\{X_{t}\})=\overline{\underline{d}}^{\prime}(\{X_{t}\})$ only for those processes for which $\overline{\underline{d}}^{\prime}(\{X_{t}\})=\overline{\underline{d}}(\{X_{t}\})$ .

VI Operational Characterizations

Information dimension was recently given an operational characterization in almost lossless data compression [4]. Specifically, Wu and Verdú defined the minimum $\epsilon$ -achievable rate $R(\epsilon)$ to be the minimum of $R>0$ such that there exists a sequence of encoders $f_{k}\colon\mathbb{R}^{k}\to\mathbb{R}^{\lfloor Rk\rfloor}$ and decoders $g_{k}\colon\mathbb{R}^{\lfloor Rk\rfloor}\to\mathbb{R}^{k}$ satisfying [4, Def. 4]

[TABLE]

for all $k$ sufficiently large. As argued in [4, Sec. IV-B], if we impose no restrictions on $f_{k}$ and $g_{k}$ , then zero rate is achievable even for $\epsilon=0$ , since the cardinality of $\mathbb{R}^{k}$ is the same for any $k$ . However, if we restrict ourselves either to encoders $f_{k}$ that are linear or to decoders $g_{k}$ that are Lipschitz continuous, then the minimum $\epsilon$ -achievable rate for collections of i.i.d. RVs $X_{1}^{k}$ with a discrete-continuous mixed distribution, i.e., a distribution of the form (12), is given by

[TABLE]

Thus, for such RVs, information dimension has an operational characterization.

For stochastic processes $\{X_{t}\}$ , Wu and Verdú further demonstrated that the minimum $\epsilon$ -achievable rate, achievable with Lipschitz-continuous decoders $g_{k}$ , can be lower-bounded as [20, Remark 4]

[TABLE]

To the best of our knowledge, for non-i.i.d. processes $\{X_{t}\}$ , no matching achievability result exists for almost lossless data compression.

In contrast, for universal compressed sensing with linear encoding and decoding via Lagrangian minimum entropy pursuit, it was shown by Jalali and Poor that $\overline{d}^{\prime}(\{X_{t}\})$ is an achievable rate when $\{X_{t}\}$ is $\psi^{*}$ -mixing:

Theorem 17 ([7, Th. 8])

Consider a $\psi^{*}$ -mixing stationary process $\{X_{t}\}$ taking value in $[0,1]$ with upper block-average information dimension $\overline{d}^{\prime}(\{X_{t}\})$ . For each $k$ , let the entries of the measurement matrix $A\in\mathbb{R}^{\ell\times k}$ be drawn i.i.d. according to a zero-mean, unit-variance, Gaussian distribution. Given $X_{1}^{k}$ generated by $\{X_{t}\}$ and $(Y_{1},\ldots,Y_{\ell})^{\textnormal{{\tiny T}}}=A(X_{1},\ldots,X_{k})^{\textnormal{{\tiny T}}}$ , let

[TABLE]

where $\mathcal{X}_{m}\triangleq\{[x]_{2^{m}}\colon\ x\in[0,1]\}$ , $\hat{H}_{j}(\cdot)$ is the conditional empirical entropy [7, Def. 1], $m=\lceil r\log\log k\rceil$ (for $r>1$ ), $j=o(\frac{\log k}{\log\log k})$ , and $\gamma=(\log k)^{2r}$ . If the number of measurements $\ell=\ell_{k}$ satisfies

[TABLE]

then

[TABLE]

as $k\to\infty$ .

In words, Theorem 17 states that if the rate of random linear measurements of $X_{1}^{k}$ is slightly larger than the block-average information dimension, then the Lagrangian relaxation of minimum entropy pursuit provides an asymptotically distortion-free estimate of $X_{1}^{k}$ in terms of the Euclidean norm. Thus, for $\psi^{*}$ -mixing processes, the block-average information dimension is an achievable rate for almost zero-distortion recovery.

We next discuss an operational characterization of the rate-distortion dimension. By Theorem 9, this is also an operational characterization of the information dimension rate. In [8], Rezagah et al. considered the almost zero-distortion recovery of stationary processes when the decoder employs compressible signal pursuit (CSP) optimization:

Theorem 18 ([8, Cor. 2])

Consider a stationary, real-valued process $\{X_{t}\}$ and a system of random linear observations $(Y_{1},\ldots,Y_{\ell})^{\textnormal{{\tiny T}}}=A(X_{1},\ldots,X_{k})^{\textnormal{{\tiny T}}}$ with measurement matrix $A\in\mathbb{R}^{\ell\times k}$ composed of i.i.d. zero-mean, unit-variance, Gaussian RVs. If the number of measurements $\ell=\ell_{k}$ satisfies

[TABLE]

then there exists a family of compression codes such that

[TABLE]

as $k\to\infty$ , where $\hat{\mathbf{X}}$ is the solution of the CSP optimization

[TABLE]

and $\mathcal{C}_{k}$ denotes the codebook of the compression code.

In words, if the rate of random linear measurements of $X_{1}^{k}$ is slightly larger than the rate-distortion dimension, then there exists a family of compression codes for which CSP optimization yields an asymptotically distortion-free estimate of $X_{1}^{k}$ in terms of the Euclidean norm. Thus, the rate-distortion dimension is an achievable rate for almost zero-distortion recovery.

To summarize, (101) demonstrates that $\overline{d}^{\prime}(\{X_{t}\})$ yields a lower bound on the sampling rate required for almost lossless recovery with Lipschitz-continuous decoders. In contrast, Theorem 18 demonstrates that $\overline{\mathsf{dim}}_{R}(\{X_{t}\})$ (and hence also $\bar{d}(\{X_{t}\})$ ) is an achievable rate for almost zero-distortion recovery. Furthermore, as illustrated by Example 4, there are processes $\{X_{t}\}$ for which

[TABLE]

Our results thus demonstrate that there exist stationary processes for which the sampling rate required for almost zero-distortion recovery is strictly smaller than the sampling rate required for almost lossless recovery with Lipschitz-continuous decoders. In other words, the fundamental limits of almost zero-distortion recovery and almost lossless recovery are different in general.

Comparing the lower bound (101) for almost lossless recovery with Theorem 18 for almost zero-distortion recovery, we observe that there are two main differences in the setup:

i)

(101) is obtained for a Lipschitz-continuous decoder $g_{k}$ , whereas Theorem 18 is based on CSP optimzation;

ii)

for almost lossless recovery, $\hat{\mathbf{X}}=g_{k}(f_{k}(X_{1}^{k}))$ is required to be exactly equal to $X_{1}^{k}$ with high probability (cf. (99)), whereas for almost zero-distortion recovery it suffices that $\frac{1}{\sqrt{k}}\|\hat{\mathbf{X}}-(X_{1},\ldots,X_{k})^{\textnormal{{\tiny T}}}\|_{2}$ be small.

The following example presents a class of stationary processes for which almost zero-distortion recovery at rate $\bar{d}(\{X_{t}\})$ may also be achieved with linear encoders and decoders. This suggests that the second difference has greater impact.

Example 5

Let $\{X_{t}\}$ be a stationary, univariate, real-valued, Gaussian process possessing a PSD $\mathsf{S}_{X}$ with support $[-1/4,1/4]$ . By Theorem 10, we have that $d(\{X_{t}\})=1/2$ . We next invoke the sampling theorem to demonstrate that there exist linear encoders $f_{k}\colon\mathbb{R}^{k}\to\mathbb{R}^{\ell_{k}}$ and decoders $g_{k}\colon\mathbb{R}^{\ell_{k}}\to\mathbb{R}^{k}$ such that

[TABLE]

and

[TABLE]

as $k\to\infty$ , where $\hat{\mathbf{X}}=g_{k}(f_{k}(X_{1}^{k}))$ .

To describe $f_{k}$ and $g_{k}$ , we divide the indices $t=1,\ldots,k$ into three groups:

[TABLE]

where $\{\Delta_{k}\}$ is an arbitrary sequence of even integers that tends to infinity sublinearly in $k$ . The encoder $f_{k}$ only reproduces the values of $X_{t}$ with indices $t\in\mathcal{I}_{1}\cup\mathcal{I}_{2}$ , i.e., $f_{k}(X_{1}^{k})=\{X_{t},\,t\in\mathcal{I}_{1}\cup\mathcal{I}_{2}\}$ . Consequently,

[TABLE]

and the rate $\ell_{k}/k$ converges to $\frac{1}{2}$ as $k\to\infty$ .

We next show that we can find a decoder $g_{k}$ for which (110) holds. Clearly, the values $\{X_{t},\,t\in\mathcal{I}_{1}\cup\mathcal{I}_{2}\}$ are directly observed. It therefore remains to estimate the missing values of $X_{1}^{k}$ , which is done via the interpolation formula

[TABLE]

It follows that

[TABLE]

where the last step is due to stationarity. By the sampling theorem for stochastic processes, the expected value on the RHS of (116) vanishes as $\Delta_{k}\to\infty$ [21, Th. 1]. Thus, dividing both sides of (116) by $k$ and letting $k\to\infty$ gives

[TABLE]

which together with Chebyshev’s inequality [22, Th. 4.10.7] yields (110).

VII Conclusions

Rényi [1] proposed the information dimension and the $d$ -dimensional entropy to measure the information content of general RVs. His idea was to quantize the real-valued RV $X$ by a uniform quantizer of step size $1/m$ , and to then analyze the entropy of the quantized RV $[X]_{m}$ in the limit as $m$ tends to infinity. His results demonstrate that any RV with positive information dimension has infinite information content. This is, e.g., the case for RVs whose probability measure has an absolutely-continuous part. The problem becomes even more interesting for stochastic processes $\{X_{t}\}$ , since their information content is not only determined by the distribution of the marginals $X_{t}$ , but also by their temporal dependence. For example, consider a stationary Gaussian process $\{X_{t}\}$ with bandlimited PSD. On the one hand, Gaussian processes have absolutely-continuous marginals, so one would expect that their information content is infinite. On the other hand, for processes with a bandlimited PSD, the present sample $X_{0}$ can be perfectly predicted from its infinite past $X_{-1},X_{-2},\ldots$ (see Example 4), which suggests that the information content of $\{X_{t}\}$ is zero.

To shed some light on such questions, we proposed a generalization of information dimension to stochastic processes by defining the information dimension rate $d(\{\mathbf{X}_{t}\})$ as the entropy rate $H^{\prime}(\{[\mathbf{X}_{t}]_{m}\})$ divided by $\log m$ in the limit as $m\to\infty$ . We demonstrated that the information dimension rate coincides with the rate-distortion dimension, defined as twice the pre-log factor of the rate-distortion function $R(D)$ . We further showed that among all stationary process with PSD $\mathsf{S}_{\mathbf{X}}$ , the Gaussian process has the largest information dimension rate. This is consistent with the observation that Gaussian processes are the hardest to predict, hence they are expected to have the largest information content. We then showed that the information dimension rate of stationary Gaussian processes is given by the average rank of $\mathsf{S}_{\mathbf{X}}$ , i.e.,

[TABLE]

Specialized to the univariate case, this yields that the information dimension rate is given by the Lebesgue measure of the support of $\mathsf{S}_{X}$ , i.e.,

[TABLE]

This agrees with the intuition that if the PSD of $\{X_{t}\}$ is zero on a set of positive Lebesgue measure, then some samples can be expressed in terms of the remaining samples and have therefore no information content. It further answers the above question whether stationary Gaussian processes with a bandlimited PSD have infinite information content in the positive, unless the PSD is zero almost everywhere.

An alternative definition for the information dimension of a stochastic process was proposed by Jalali and Poor [7] as the information dimension of $X_{1}^{k}$ divided by $k$ in the limit as $k\to\infty$ . We referred to this quantity as the block-average information dimension $d^{\prime}(\{X_{t}\})$ . While $d(\{X_{t}\})$ and $d^{\prime}(\{X_{t}\})$ coincide for $\psi^{*}$ -mixing processes, in general we have that $d(\{X_{t}\})\leq d^{\prime}(\{X_{t}\})$ , where the inequality can be strict. In particular, as illustrated by Example 4, if the support of $\mathsf{S}_{X}$ of the Gaussian process $\{X_{t}\}$ has positive Lebesgue measure, then $d^{\prime}(\{X_{t}\})=1$ . Thus, in contrast to the information dimension rate, the block-average information dimension does not capture the dependence of the information dimension on the support size of $\mathsf{S}_{X}$ .

The essential difference between the definitions of $d(\{X_{t}\})$ and $d^{\prime}(\{X_{t}\})$ is the order in which the limits over the quantization bin size $1/m$ and the block size $k$ are taken. Rezagah et al. [8] showed that these limits can be exchanged if the process satisfies

[TABLE]

in which case $\mathsf{dim}_{R}(\{X_{t}\})=d^{\prime}(\{X_{t}\})$ . However, in this case the information dimension of the stochastic process $\{X_{t}\}$ coincides with the information dimension of the marginal RV $X_{t}$ . In other words, for such processes a generalization of information dimension to stochastic processes is redundant. In contrast, we showed in Theorem 9 that, for any stochastic process $\{X_{t}\}$ , the information dimension rate $d(\{X_{t}\})$ coincides with the rate-distortion dimension $\mathsf{dim}_{R}(\{X_{t}\})$ . This implies that $d^{\prime}(\{X_{t}\})$ coincides with $\mathsf{dim}_{R}(\{X_{t}\})$ only for those stochastic processes for which $d^{\prime}(\{X_{t}\})=d(\{X_{t}\})$ .

The equivalence between the information dimension rate $d(\{X_{t}\})$ and the rate-distortion dimension $\mathsf{dim}_{R}(\{X_{t}\})$ implies that $d(\{X_{t}\})$ inherits the operational characterizations of $\mathsf{dim}_{R}(\{X_{t}\})$ . For example, it was demonstrated in [8] that $\mathsf{dim}_{R}(\{X_{t}\})$ is an achievable rate for almost zero-distortion recovery. In contrast, [20] shows that $d^{\prime}(\{X_{t}\})$ is a lower bound on the minimum $\epsilon$ -achievable rate, achievable with Lipschitz-continuous decoders. By demonstrating that there are processes for which

[TABLE]

our results show that the fundamental limits of almost zero-distortion recovery and almost lossless recovery are different in general. Jalali and Poor [7] further showed that $d^{\prime}(\{X_{t}\})$ is an achievable rate for universal lossless compressed sensing with linear encoding and decoding via Lagrangian minimum entropy pursuit when $\{X_{t}\}$ is $\psi^{*}$ -mixing. Since for $\psi^{*}$ -mixing processes we have $d(\{X_{t}\})=d^{\prime}(\{X_{t}\})$ , our definition also inherits this operational characterization.

Appendix A Appendix to Section III

A-A Proof of Lemma 2

The first inequality in (14), namely,

[TABLE]

follows directly from Fatou’s lemma [22, Th. 1.6.8, p. 50]. The second inequality follows because the limit inferior is upper-bounded by the limit superior. For the third inequality, note that for every $m=2,3,\ldots$ and $Y=y$ [1, eq. (11)]

[TABLE]

Furthermore, since conditioning reduces entropy, we have

[TABLE]

for every $m=2,3,\ldots$ Hence, the RHS of (123) is integrable, and the third inequality in (14) follows again by Fatou’s lemma.

A-B Proof of Lemma 4

If $H([X_{1}^{k}]_{1})=\infty$ , then we have $d(X_{1}^{k})=\infty$ and the right-most inequality in (17) holds trivially. Moreover, in this case $H([X_{t}]_{1})=\infty$ for at least one $t$ , so for this $t$ we also have $\bar{d}(X_{t})=\infty$ . Thus, also the left-most inequality holds.

If $H([X_{1}^{k}]_{1})<\infty$ , then we have

[TABLE]

hence the upper information dimensions are finite. It follows by the chain rule of entropy and because conditioning reduces entropy that

[TABLE]

Likewise, we have

[TABLE]

where the inequality follows because conditioning reduces entropy and because, conditioned on $X_{1}^{t-1}$ , $[X_{t}]_{m}$ is independent of $[X_{1}^{t-1}]_{m}$ .

A-C Proof of Theorem 5

To simplify notation, we shall write collections of RVs as vectors, namely, $\mathbf{X}=(X_{1},\ldots,X_{k})^{\textnormal{{\tiny T}}}$ and $\mathbf{Y}=(Y_{1},\ldots,Y_{\ell})^{\textnormal{{\tiny T}}}$ . The proof of Theorem 5 is based on the following lemma.

Lemma 19

Let $\mathbf{X}$ and $\mathbf{Y}$ be $k$ - and $\ell$ -dimensional, jointly Gaussian vectors with mean vectors ${\boldsymbol{\mu}}_{\mathbf{X}}$ and ${\boldsymbol{\mu}}_{\mathbf{Y}}$ and joint covariance matrix $C_{\mathbf{X},\mathbf{Y}}$ . Then, there exists a $k\times\ell$ matrix $T$ and a length- $k$ vector ${\boldsymbol{\mu}}$ such that $\mathsf{E}\left[\mathbf{X}|\mathbf{Y}\right]={\boldsymbol{\mu}}+T\mathbf{Y}$ . Moreover, $\mathbf{E}\triangleq\mathbf{X}-{\boldsymbol{\mu}}-T\mathbf{Y}$ has zero mean, is uncorrelated with $\mathbf{Y}$ , and satisfies $C_{\mathbf{E}}=C_{\mathbf{X}}-TC_{\mathbf{Y}}T^{\textnormal{{\tiny T}}}$ .

Proof:

If $\mathbf{X}$ and $\mathbf{Y}$ are jointly Gaussian, then $\mathbf{X}$ can be written as a linear transformation of $\mathbf{Y}$ and an uncorrelated error. This follows from the fact that, for jointly Gaussian $\mathbf{X}$ and $\mathbf{Y}$ , the linear minimum mean-square error (LMMSE) estimator of $\mathbf{X}$ given $\mathbf{Y}$ always exists and is given by $\mathsf{E}\left[\mathbf{X}|\mathbf{Y}\right]={\boldsymbol{\mu}}+T\mathbf{Y}$ . The result that $\mathbf{E}$ has zero mean, is uncorrelated with $\mathbf{Y}$ , and satisfies $C_{\mathbf{E}}=C_{\mathbf{X}}-TC_{\mathbf{Y}}T^{\textnormal{{\tiny T}}}$ follows by direct calculation. ∎

Since information dimension is translation invariant, it follows that

[TABLE]

Furthermore, since $\mathbf{X}$ and $\mathbf{Y}$ are jointly Gaussian, so are $\mathbf{Y}$ and $\mathbf{E}$ , and from the fact that they are uncorrelated follows that they are independent. Thus,

[TABLE]

where $C_{\mathbf{E}}$ is the covariance matrix of $\mathbf{E}$ . The identities (128) and (129) hold for every $\mathbf{y}$ , so it follows from Lemma 2 that $d(\mathbf{X}|\mathbf{Y})=\mathrm{rank}(C_{\mathbf{E}})$ . It remains to show that $C_{\mathbf{E}}$ is the generalized Schur complement of $C_{\mathbf{Y}}$ in $C_{\mathbf{X},\mathbf{Y}}$ . Indeed, by [12, 7.1.P28] there exists a matrix $W$ such that $C_{\mathbf{Y}\mathbf{X}}=C_{\mathbf{Y}}W$ . The generalized Schur complement of $C_{\mathbf{Y}}$ in $C_{\mathbf{X},\mathbf{Y}}$ is then given by

[TABLE]

Comparing (130) with the expression of $C_{\mathbf{E}}$ given in Lemma 19, we observe that $C_{\mathbf{E}}=C_{\mathbf{X}|\mathbf{Y}}$ if the matrix $T$ in Lemma 19 satisfies $C_{\mathbf{Y}\mathbf{X}}=C_{\mathbf{Y}}T^{\textnormal{{\tiny T}}}$ . This is indeed the case: since $\mathbf{X}={\boldsymbol{\mu}}+T\mathbf{Y}+\mathbf{E}$ , and since $\mathbf{Y}$ and $\mathbf{E}$ are uncorrelated, we have that

[TABLE]

This proves Theorem 5.

Appendix B Proof of Lemma 7

To prove Lemma 7, we shall need the following auxiliary result.

Lemma 20

Let $X_{1}^{k}$ be a collection of real-valued RVs, and let $f\colon\mathbb{R}^{k}\to\mathbb{R}^{\ell}$ be Lipschitz continuous with Lipschitz constant $\mathsf{K}$ . Then,

[TABLE]

Proof:

Note that if $[X_{1}^{k}]_{m}=z_{1}^{k}/m$ for some $z_{1}^{k}\in\mathbb{Z}^{k}$ , then $X_{1}^{k}\in\mathcal{C}(z_{1}^{k}/m,1/m)$ , a cube with diameter $\sqrt{k}/m$ . The image of this cube under the Lipschitz function $f$ has a diameter not greater than $\mathsf{K}\sqrt{k}/m$ . Computing $[f(X_{1}^{k})]_{m}$ induces a partition of $\mathbb{R}^{\ell}$ into $\ell$ -dimensional cubes. Of this partition, at most $\lceil\mathsf{K}\sqrt{k}+1\rceil^{\ell}$ elements have a nonempty intersection with the image of $\mathcal{C}(z_{1}^{k}/m,1/m)$ under $f$ . Therefore,

[TABLE]

for every $z_{1}^{k}\in\mathbb{Z}^{k}$ , so Lemma 20 follows by averaging over $[X_{1}^{k}]_{m}$ . ∎

We next prove Lemma 7. Let $\mathbf{Y}_{t}=f_{t}(\mathbf{X}_{t})$ . To prove the right-most relation in (28), we use that for every $k$ and $m$

[TABLE]

The second summand can be further upper-bounded by

[TABLE]

Since every function $f_{t}$ is Lipschitz with a Lipschitz constant at most $\mathsf{K}\triangleq\sup_{t\in\mathbb{Z}}\mathsf{K}_{t}$ , we can use Lemma 20 to bound the RHS of (135) by $kM\log\lceil\mathsf{K}\sqrt{L}+1\rceil$ . Since this term is independent of $m$ , the contribution of the second summand on the RHS of (134) vanishes as $m\to\infty$ . We thus obtain $\overline{\underline{d}}(\{\mathbf{X}_{t},\mathbf{Y}_{t}\})=\overline{\underline{d}}(\{\mathbf{X}_{t}\})$ by dividing both sides of (134) by $k\log m$ and letting $k$ and $m$ tend to infinity.

To prove the left-most relation in (28), we use that for every $k$ and $m$

[TABLE]

The claim follows then by dividing both sides of (136) by $k\log m$ and letting $k$ and $m$ tend to infinity.

Appendix C Proof of Lemma 8

For every $m$ and $k$ , we have

[TABLE]

Dividing by $k\log m$ and letting first $k$ and then $m$ tend to infinity yields (30).

To prove (31), we note that Lemma 7 and (30) yield $\overline{\underline{d}}(\{\mathbf{X}_{t}+\mathbf{Z}_{t}\})\leq\overline{\underline{d}}(\{\mathbf{X}_{t}\})$ . For the reverse inequality, we use [13, Lemma 30] and the fact that conditioning reduces entropy to obtain

[TABLE]

Dividing both sides of (138) by $k\log m$ , and letting first $k$ and then $m$ tend to infinity, yields $\overline{\underline{d}}(\{\mathbf{X}_{t}+\mathbf{Z}_{t}\})\geq\overline{\underline{d}}(\{\mathbf{X}_{t}\})$ and proves (31).

Finally, if $Z$ is discrete and $H([\mathbf{X}_{1}]_{1})=\infty$ , then $H([\mathbf{X}_{1}^{k}]_{m}|Z)=\infty$ , since

[TABLE]

where the second entropy is finite by assumption and the first entropy satisfies $H([\mathbf{X}_{1}]_{m})\geq H([\mathbf{X}_{1}]_{1})=\infty$ . Conversely, if $Z$ is discrete and $H([\mathbf{X}_{1}]_{1})<\infty$ , then

[TABLE]

Dividing all terms by $k$ and letting $k$ tend to infinity thus yields

[TABLE]

Since $I([\mathbf{X}_{1}^{k}]_{m};Z)\leq H(Z)<\infty$ , the second term on the RHS of (141) tends to zero as $k$ tends to infinity. Thus, dividing (141) by $\log m$ , and letting $m$ tend to infinity, yields (32).

Appendix D Proof of Theorem 9

The proof of Theorem 9 is essentially identical to the proof of [2, Lemma 3.2]. For the sake of completeness, we reproduce the full proof here. Indeed, choosing in (33)

[TABLE]

yields

[TABLE]

since for the choice (142) we have $\|\mathbf{X}_{1}^{k}-\hat{\mathbf{X}}_{1}^{k}\|_{2}^{2}\leq\frac{kL}{m^{2}}=kD$ , hence it satisfies (34). Consequently, dividing by $-k\log D$ , and taking limits as $k\to\infty$ and $D\downarrow 0$ , we obtain

[TABLE]

if the limits exist. If the limits do not exist, then we obtain the same upper bound for the limits replaced by limits superior and limits inferior.444Since $m^{2}=L/D$ , taking the limit as $D\downarrow 0$ is tantamount to taking the limit as $m\to\infty$ .

We next derive a lower bound on the rate-distortion dimension. To simplify notation, we treat the collection $\mathbf{X}_{1}^{k}$ of $k$ $L$ -variate random vectors as a collection of $k^{\prime}=kL$ RVs. To show that the upper bound (144) holds with equality, we use the following lower bound on $R(\mathbf{X}_{1}^{k},D)$ given in [23], [2, eq. (A.1)]:

[TABLE]

where $\lambda_{s}\colon\mathbb{R}^{k^{\prime}}\to[0,\infty)$ is an arbitrary nonnegative measurable function satisfying

[TABLE]

Following the proof of [2, Lemma 3.2], we apply (145) with

[TABLE]

We first show that this choice of $\lambda_{s}$ satisfies (146). Indeed,

[TABLE]

where the second step follows by substituting $\tilde{x}_{\ell}=mx_{\ell}-i_{\ell}$ and $\tilde{y}_{\ell}=my_{\ell}-j_{\ell}$ . Since the sum over $i_{\ell}$ does not depend on $j_{\ell}$ , it follows that

[TABLE]

which can be upper-bounded as

[TABLE]

Hence,

[TABLE]

which, by (149), is equal to $1$ . It follows that $s$ and $\lambda_{s}$ , as chosen in (147) and (148), satisfy (146).

We next evaluate (145) for this choice of $s$ and $\lambda_{s}$ and for distortion $kD$ . This yields

[TABLE]

For $m^{2}=L/D$ , this becomes

[TABLE]

We next replace again the collection $X_{1}^{k^{\prime}}$ of RVs by the equivalent collection $\mathbf{X}_{1}^{k}$ of random vectors. Dividing both sides of (155) by $-k\log D$ , and taking the limits as $k\to\infty$ and $D\downarrow 0$ , yields

[TABLE]

if the limits over $k$ and $D$ exist. If the limits do not exist, then we obtain the same lower bound for the limits replaced by limits superior and limits inferior. Combining (156) with (144) proves Theorem 9.

Appendix E Proof of Theorem 10

The proof consists of two parts. In the first part, we show that of all processes $\{\mathbf{X}_{t}\}$ with a given SDF $\mathsf{F}_{\mathbf{X}}$ , the Gaussian process has the largest information dimension rate (Section E-A). In the second part, we demonstrate that the information dimension rate of Gaussian processes is given by the average rank of the derivative of the SDF (Section E-B).

E-A Gaussian Processes Maximize the Information Dimension

By Theorem 9, the upper information dimension rate is given by

[TABLE]

The claim that the information dimension is maximized by a Gaussian process then follows by the well-known fact that of all random vectors $\mathbf{X}_{1}^{k}$ with a given covariance matrix $C_{\mathbf{X}_{1}^{k}}$ , the Gaussian random vector has the largest rate-distortion function $R(\mathbf{X}_{1}^{k},kD)$ .

To prove this claim for multivariate sources, we shall write the collection of $L$ -variate vectors $\mathbf{X}_{1}^{k}$ as a collection of $k^{\prime}$ RVs $X_{1}^{k^{\prime}}$ , where $k^{\prime}=kL$ . Since the information dimension rate is translation invariant (Lemma 7), we can assume without loss of optimality that the RVs $X_{1}^{k^{\prime}}$ have zero mean. Furthermore, by the eigenvalue decomposition, there exists an orthogonal matrix $W$ such that the random variables $Y_{1}^{k^{\prime}}$ given by $(Y_{1},\dots,Y_{k^{\prime}})^{\textnormal{{\tiny T}}}=W^{\textnormal{{\tiny T}}}(X_{1},\ldots,X_{k^{\prime}})^{\textnormal{{\tiny T}}}$ are uncorrelated and their variances are the eigenvalues of $C_{X_{1}^{k^{\prime}}}$ , which we shall denote by $\lambda_{1},\ldots,\lambda_{k^{\prime}}$ . Since mutual information is invariant under bijections, and the Euclidean norm is invariant under multiplications by orthogonal matrices, it follows that $R(Y_{1}^{k^{\prime}},kD)=R(X_{1}^{k^{\prime}},kD)$ .

For the case where $Y_{1}^{k^{\prime}}$ are independent, zero-mean, Gaussian random variables with variances $\lambda_{1},\ldots,\lambda_{k^{\prime}}$ , the rate-distortion function is given by [10, Th. 13.3.3]

[TABLE]

where

[TABLE]

and $\xi$ is chosen so that $D_{1}+\ldots+D_{k^{\prime}}=kD$ .

The RHS of (158) can also be achieved for non-Gaussian RVs by choosing the following (possibly suboptimal) distribution of the reconstruction values $\hat{Y}_{1}^{k^{\prime}}$ :

[TABLE]

where $Z_{1}^{k^{\prime}}$ are independent, zero-mean, Gaussian RVs with variances $\frac{D_{t}\lambda_{t}}{\lambda_{t}-D_{t}}$ , and $\xi$ is as in (159). Indeed, it is easy to check that (160) satisfies the distortion constraint (34). Furthermore, by using that conditioning reduces entropy and that Gaussian RVs maximize differential entropy, it can be shown that

[TABLE]

Comparing (161) with (158), we conclude that, for uncorrelated RVs $Y_{1}^{k^{\prime}}$ ,

[TABLE]

where $(Y_{1}^{k^{\prime}})_{G}$ are jointly Gaussian with the same covariance matrix as $Y_{1}^{k^{\prime}}$ . Since $R(Y_{1}^{k^{\prime}},kD)=R(X_{1}^{k^{\prime}},kD)$ and $R\bigl{(}(Y_{1}^{k^{\prime}})_{G},kD\bigr{)}=R\bigl{(}(X_{1}^{k^{\prime}})_{G},kD\bigr{)}$ , the same is also true for general RVs. Together with (157), this proves that of all processes $\{\mathbf{X}_{t}\}$ with a given SDF $\mathsf{F}_{\mathbf{X}}$ , the Gaussian process has the largest information dimension rate.

E-B The Information Dimension of Gaussian Processes

We now assume that $\{\mathbf{X}_{t}\}$ is Gaussian. For every $i$ , we define

[TABLE]

Furthermore, let $U_{i,t}$ be i.i.d. (over all $i$ and $t$ ) and uniformly distributed on $[0,1/m)$ , and let $W_{i,t}\triangleq[X_{i,t}]_{m}+U_{i,t}$ . We define $\{[\mathbf{X}_{t}]_{m}\}$ , $\{\mathbf{N}_{t}\}$ , and $\{\mathbf{U}_{t}\}$ as the corresponding multivariate processes. Since $\{U_{i,t}\}$ is independent of $\{[X_{j,t}]_{m}\}$ for every $i,j$ , the (matrix-valued) SDFs of $\{\mathbf{W}_{t}\}$ , $\{[\mathbf{X}_{t}]_{m}\}$ , and $\{\mathbf{U}_{t}\}$ satisfy

[TABLE]

Moreover, the (matrix-valued) PSD of $\{\mathbf{U}_{t}\}$ exists and equals

[TABLE]

Since the information dimension rate is translation invariant (Lemma 7), and since the SDF $\mathsf{F}_{\mathbf{X}}$ does not depend on the mean vector ${\boldsymbol{\mu}}$ , we can assume without loss of generality that $\{\mathbf{X}_{t}\}$ has zero mean. We further show in Lemma 21 in Appendix E-C that we can assume without loss of generality that every component process of $\{\mathbf{X}_{t}\}$ has unit variance. By (48) in Lemma 11, it thus follows that

[TABLE]

We continue by writing the entropy of $[\mathbf{X}_{1}^{k}]_{m}$ in terms of a differential entropy, i.e.,

[TABLE]

Denoting by $(\mathbf{W}_{1}^{k})_{G}$ a Gaussian vector with the same mean and covariance matrix as $\mathbf{W}_{1}^{k}$ , and denoting by $f_{\mathbf{W}_{1}^{k}}$ and $g_{\mathbf{W}_{1}^{k}}$ the PDFs of $\mathbf{W}_{1}^{k}$ and $(\mathbf{W}_{1}^{k})_{G}$ , respectively, this can be expressed as

[TABLE]

Dividing by $k\log m$ , and letting first $k$ and then $m$ tend to infinity, yields the information dimension rate $d(\{\mathbf{X}_{t}\})$ . Lemma 22 in Appendix E-C shows that

[TABLE]

for some constant $\Xi$ that is independent of $(k,m)$ . Moreover, the differential entropy rate of the stationary, $L$ -variate, Gaussian process $(\{\mathbf{W}_{t}\})_{G}$ is given by [15, Th. 7.10]

[TABLE]

It thus follows that the information dimension rate of $\{\mathbf{X}_{t}\}$ equals

[TABLE]

It remains to show that the RHS of (171) is equal to the RHS of (43). To do so, we first show that the integral on the RHS of (171) can be restricted to a subset $\mathcal{F}_{\Upsilon}^{\mathsf{c}}\subseteq[-1/2,1/2]$ on which the entries of $\mathsf{F}^{\prime}_{\mathbf{N}}(\theta)$ are bounded from above by $\Upsilon/m^{2}$ for some $\Upsilon>0$ . We then show that, on this set, $\det\mathsf{F}^{\prime}_{\mathbf{W}}(\theta)$ can be bounded from above and from below by products of affine transforms of the eigenvalues of $\mathsf{F}^{\prime}_{\mathbf{X}}(\theta)$ . These bounds are asymptotically tight, i.e., they are equal in the limit as $m$ tends to infinity. We complete the proof by showing that the order of limit and integration can be exchanged.

E-B1 Restriction on $\mathcal{F}_{\Upsilon}^{\mathsf{c}}\subseteq[-1/2,1/2]$

Choose $\Upsilon>0$ and let

[TABLE]

By (47) in Lemma 11, we have for every $i$

[TABLE]

Since the set $\mathcal{F}_{\Upsilon}$ is the union of $\mathcal{F}^{(i)}_{\Upsilon}$ , $i=1,\ldots,L$ , it then follows by the union bound that

[TABLE]

To prove (174), we note that, by the Lebesgue decomposition theorem [22, Th. 2.2.6] and the fact that the Radon-Nikodym derivative $\mathrm{d}\mathsf{F}_{N_{i}}(\theta)/\mathrm{d}\lambda$ coincides with $\mathsf{F}^{\prime}_{N_{i}}(\theta)$ almost everywhere [22, Sec. 2.3],

[TABLE]

where the second inequality follows because $\theta\mapsto\mathsf{F}^{\prime}_{N_{i}}(\theta)$ is nonnegative, and the third inequality follows by definition of $\mathcal{F}^{(i)}_{\Upsilon}$ . By (47) in Lemma 11, the integral on the left-hand side (LHS) of (176) is upper-bounded by $1/m^{2}$ , hence (174) follows.

By (164) and (165), we have that

[TABLE]

Since derivatives of matrix-valued SDFs are positive semidefinite, it follows that

[TABLE]

Hence,

[TABLE]

where the last step follows from (175). Applying Hadamard’s and Jensen’s inequality, we further get

[TABLE]

where the last step follows from (47), (164), (166), and the assumption that every component process of $\{\mathbf{X}_{t}\}$ has zero mean and unit variance. Since, by (46) in Lemma 11, $a_{1}\to 1$ as $m\to\infty$ , (180) yields

[TABLE]

Consequently,

[TABLE]

for every $\Upsilon$ . It follows that this integral does not contribute to the information dimension rate if we let $\Upsilon$ tend to infinity. In view of (171), we thus obtain the information dimension rate $d(\{\mathbf{X}_{t}\})$ by evaluating

[TABLE]

in the limit as first $m$ and then $\Upsilon$ tends to infinity.

E-B2 Bounding $\det\mathsf{F}^{\prime}_{\mathbf{W}}(\theta)$ by the Eigenvalues of $\mathsf{F}^{\prime}_{\mathbf{X}}(\theta)$

Lemma 11 and (177) yield

[TABLE]

Let $\chi_{i}(\theta)$ , $i=1,\dots,L$ , denote the eigenvalues of $\mathsf{F}^{\prime}_{\mathbf{X}}(\theta)$ . Since $\mathsf{F}^{\prime}_{\mathbf{N}}(\theta)$ is positive semidefinite, we obtain

[TABLE]

We next derive an upper bound on $\det\mathsf{F}^{\prime}_{\mathbf{W}}(\theta)$ . Let $\|\mathsf{F}^{\prime}_{\mathbf{N}}(\theta)\|_{1}\triangleq\sum_{i,j=1}^{n}|\mathsf{F}^{\prime}_{N_{i}N_{j}}(\theta)|$ denote the $\ell_{1}$ -matrix norm of $\mathsf{F}^{\prime}_{\mathbf{N}}(\theta)$ . Since $\mathsf{F}^{\prime}_{\mathbf{N}}(\theta)$ is positive semidefinite, the element with the maximum modulus is on the main diagonal; cf. [12, Problem 7.1.P1]. Furthermore, by assumption, on $\mathcal{F}_{\Upsilon}^{\mathsf{c}}$ the diagonal elements of $\mathsf{F}_{\mathbf{N}}(\theta)$ are bounded from above by $\frac{\Upsilon}{m^{2}}$ . We hence obtain that

[TABLE]

It is known that all matrix norms bound the largest eigenvalue of the matrix from above [12, Th. 5.6.9].555This bound holds without a multiplicative constant, since the spectral radius of a matrix is the infimum of all matrix norms [12, Lemma 5.6.10]. Thus, the upper bound (186) is also an upper bound on the largest eigenvalue of $\mathsf{F}^{\prime}_{\mathbf{N}}(\theta)$ . Let $\omega_{i}(\theta)$ , $i=1\dots,L$ , denote the eigenvalues of $\mathsf{F}^{\prime}_{\mathbf{W}}(\theta)$ . Then, we have for $m^{2}\geq 8/\pi$ (such that $2a_{1}-1\geq 0$ ) [12, Cor. 4.3.15]

[TABLE]

Combining (185) and (187) with (183), we obtain

[TABLE]

To compute the limit of (183) as $m\to\infty$ , we thus need to evaluate

[TABLE]

where $\mathsf{K}$ is either $1/12$ (left-most inequality in (188)) or $1/12+L^{2}\Upsilon$ (right-most inequality in (188)).

E-B3 Exchanging Limit and Integration

To evaluate (189), we continue along the lines of [24, Sec. VIII]. Specifically, for each $i$ , we split the integral on the RHS of (189) into three parts:

[TABLE]

where $0<\varepsilon<1$ is arbitrary.

For the first part, we obtain

[TABLE]

which evaluates to $\lambda(\mathcal{F}_{I})$ in the limit as $m\to\infty$ .

We next show that the integrals over $\mathcal{F}_{II}$ and $\mathcal{F}_{III}$ do not contribute to (189). To this end, it suffices to consider the integral of the function

[TABLE]

In the remainder of the proof, we shall assume without loss of generality that $m^{2}>8/\pi$ , in which case $A_{m}(\theta)>0$ on $\theta\in\mathcal{F}_{II}\cup\mathcal{F}_{III}$ . Clearly, whenever $A_{m}(\theta)>0$ , the function in (194) converges to zero as $m\to\infty$ . Moreover, for $A_{m}(\theta)\geq 1$ , this function is nonpositive.

For all $\theta\in\mathcal{F}_{II}$ we have $A_{m}(\theta)\geq(2a_{1}-1)/(1-\varepsilon)$ , hence we can find a sufficiently large $m_{0}$ such that, by (46) in Lemma 11, we have $A_{m}(\theta)\geq 1$ , $m\geq m_{0}$ . Since by the same result we also have $2a_{1}-1\leq 2$ , $m^{2}>8/\pi$ , it follows that, for $m>\max\{m_{0},\sqrt{8/\pi}\}$ ,

[TABLE]

The LHS of (195) is nonpositive and monotonically increases to zero as $m\to\infty$ . We can thus apply the monotone convergence theorem [22, Th. 1.6.7, p. 49] to get

[TABLE]

We next turn to the case $\theta\in\mathcal{F}_{III}$ . It was shown in [24, p. 443] that if $A_{m}(\theta)<1$ , then the function in (194) is bounded from above by 1. Furthermore, if $A_{m}(\theta)<1-\frac{1}{m^{2}}$ then it is nonnegative, and if $A_{m}(\theta)\geq 1-\frac{1}{m^{2}}$ then it is nonpositive and monotonically increasing in $m$ . Restricting ourselves to the case $m^{2}>8/\pi$ , we thus obtain for $\theta\in\mathcal{F}_{III}$

[TABLE]

where we made use of the fact that $A_{m}(\theta)<(2a_{1}-1)/(1-\varepsilon)$ , $\theta\in\mathcal{F}_{III}$ and, by (46) in Lemma 11, $2a_{1}-1\leq 2$ , $m^{2}>8/\pi$ . Hence, on $\mathcal{F}_{III}$ the magnitude of the function in (194) is bounded by

[TABLE]

We can thus apply the dominated convergence theorem [22, Th. 1.6.9, p. 50] to get

[TABLE]

Combining (193), (196), and (199), we can evaluate (189) as

[TABLE]

E-B4 Wrapping Up

To compute the limit of (183) as first $m$ and then $\Upsilon$ tends to infinity, it remains to let $\Upsilon\to\infty$ on the RHS of (200). By the continuity of the Lebesgue measure, this yields

[TABLE]

To summarize, combining (171), (182), and (200), we obtain that

[TABLE]

This proves Theorem 10.

E-C Auxiliary Results

Lemma 21

Suppose that $\{\mathbf{X}_{t}\}$ is a stationary, $L$ -variate, real-valued, Gaussian process with mean vector ${\boldsymbol{\mu}}$ and SDF $\mathsf{F}_{\mathbf{X}}$ . Suppose that the component processes are ordered by their variances, i.e.,

[TABLE]

Then,

[TABLE]

and, for almost every $\theta$ ,

[TABLE]

Proof:

Normalizing component processes with positive variance to unit variance does not affect the information dimension rate, as follows from Lemma 7. If $\sigma_{i}^{2}=0$ , then the component process $\{X_{i,t}\}$ is almost surely constant. It follows that $H([X_{i,1}^{k}]_{m})=0$ for every $m$ and every $k$ , so

[TABLE]

Dividing by $k\log m$ , and letting $m$ and $k$ tend to infinity, shows that $d(\{\mathbf{X}_{t}\})=d(\{\frac{1}{\sigma_{1}}X_{1,t},\dots,\frac{1}{\sigma_{L^{\prime}}}X_{L^{\prime},t}\})$ .

Let $\Pi$ be an $L^{\prime}\times L^{\prime}$ diagonal matrix with values $\sigma_{i}$ on the main diagonal. For component processes with zero variance, the corresponding row and column of $\mathsf{F}^{\prime}_{\mathbf{X}}(\theta)$ is zero almost everywhere. Hence, we have for almost every $\theta$ that

[TABLE]

where [math] denotes an all-zero matrix of appropriate size. We thus have $\mathrm{rank}(\mathsf{F}^{\prime}_{\mathbf{X}}(\theta))=\mathrm{rank}(\mathsf{F}^{\prime}_{(X_{1}/\sigma_{1},\dots,X_{L^{\prime}}/\sigma_{L^{\prime}})}(\theta))$ for almost every $\theta$ . ∎

Lemma 22

Let $\mathbf{X}$ be an $\ell$ -variate, real-valued, Gaussian vector with mean vector ${\boldsymbol{\mu}}_{\mathbf{X}}$ and covariance matrix $C_{\mathbf{X}}$ . Let $\mathbf{W}\triangleq[\mathbf{X}]_{m}+\mathbf{U}$ , where $\mathbf{U}$ is an $\ell$ -variate vector, independent of $\mathbf{X}$ , with components independently and uniformly distributed on $[0,1/m)$ . Then,

[TABLE]

Proof:

By [25, Th. 23.6.14], $\mathbf{X}=(X_{1},\ldots,X_{\ell})^{\textnormal{{\tiny T}}}$ can be written as

[TABLE]

where $\mathbf{X}^{\prime}$ is an $\ell^{\prime}$ -dimensional, zero-mean, Gaussian vector ( $\ell^{\prime}\leq\ell$ ) with independent components whose variances are the nonzero eigenvalues of $C_{\mathbf{X}}$ and where $A$ is an $\ell\times\ell^{\prime}$ matrix satisfying $A^{\textnormal{{\tiny T}}}A=I_{\ell^{\prime}}$ . We use the data processing inequality, the chain rule for relative entropy, and the fact that $\mathbf{X}^{\prime}$ is Gaussian, to obtain

[TABLE]

where $g_{\mathbf{W},\mathbf{X}^{\prime}}$ denotes the PDF of a Gaussian vector with the same mean vector and covariance matrix as $(\mathbf{W},\mathbf{X}^{\prime})$ , and

[TABLE]

To evaluate the relative entropy on the RHS of (210), we first note that, given $\mathbf{X}$ , the random vector $\mathbf{W}$ is uniformly distributed on an $\ell$ -dimensional cube of length $\frac{1}{m}$ . Since $\mathbf{X}$ can be obtained from $\mathbf{X}^{\prime}$ via (209), the conditional PDF of $\mathbf{W}$ given $\mathbf{X}^{\prime}=\mathbf{x}^{\prime}$ is

[TABLE]

Consequently, denoting $\mathbf{z}=[A\mathbf{x}^{\prime}+{\boldsymbol{\mu}}_{\mathbf{X}}]_{m}$ ,

[TABLE]

where ${\boldsymbol{\mu}}_{\mathbf{W}|\mathbf{X}^{\prime}=\mathbf{x}^{\prime}}$ and $C_{\mathbf{W}|\mathbf{X}^{\prime}}$ denote the conditional mean and the conditional covariance matrix of $\mathbf{W}$ given $\mathbf{X}^{\prime}=\mathbf{x}^{\prime}$ . These can be computed as [25, Th. 23.7.4]

[TABLE]

where $C_{\mathbf{W}\mathbf{X}^{\prime}}$ denotes the cross-covariance matrix of $\mathbf{W}$ and $\mathbf{X}^{\prime}$ , and $C_{\mathbf{W}}$ and $C_{\mathbf{X}^{\prime}}$ denote the covariance matrices of $\mathbf{W}$ and $\mathbf{X}^{\prime}$ , respectively.

Defining $\mathbf{Z}\triangleq[\mathbf{X}]_{m}$ , we have $\mathbf{W}=\mathbf{Z}+\mathbf{U}$ . Since $\mathbf{U}$ is independent of $\mathbf{X}$ , the cross-covariance matrix of $\mathbf{W}$ and $\mathbf{X}$ is equal to the cross-covariance matrix of $\mathbf{Z}$ and $\mathbf{X}$ . Bussgang’s theorem [26, eq. (20)] yields $K_{Z_{j}X_{i}}(\tau)=a_{j}K_{X_{j}X_{i}}(\tau)$ , where $a_{j}$ is defined in (45). Hence, if $\Lambda_{\mathbf{a}}$ is a diagonal matrix with $\mathbf{a}=(a_{1},\dots,a_{\ell})$ on the main diagonal, then $C_{\mathbf{Z}\mathbf{X}}=\Lambda_{\mathbf{a}}C_{\mathbf{X}}$ . From (209) we get $C_{\mathbf{X}}=AC_{\mathbf{X}^{\prime}}A^{\textnormal{{\tiny T}}}$ and $C_{\mathbf{W}\mathbf{X}^{\prime}}=C_{\mathbf{W}\mathbf{X}}A$ , hence

[TABLE]

Together with (215) and (216), this yields

[TABLE]

Combining (218) with (209), and using the triangle inequality, we upper-bound each component of $\mathbf{w}-{\boldsymbol{\mu}}_{\mathbf{W}|\mathbf{X}^{\prime}=\mathbf{x}^{\prime}}$ as

[TABLE]

The first and the third term on the RHS of (220) are both upper-bounded by $\frac{1}{m}$ , and the second term is upper-bounded by $\frac{1}{2m}$ . From (46) in Lemma 11, we get that the term $|1-a_{j}|$ is upper-bounded by $1/m\sqrt{2/\pi\sigma_{j}^{2}}$ , where $\sigma_{j}^{2}$ is the variance of $X_{j}$ . We thus obtain

[TABLE]

We next note that, since $\mathbf{W}=\mathbf{Z}+\mathbf{U}$ , and since $\mathbf{U}$ is independent from $\mathbf{Z}$ and i.i.d. on $[0,1/m)$ ,

[TABLE]

It can be shown that $C_{\mathbf{Z}}-\Lambda_{\mathbf{a}}C_{\mathbf{X}}\Lambda_{\mathbf{a}}$ is the conditional covariance matrix of $\mathbf{Z}$ given $\mathbf{X}^{\prime}$ , hence it is positive semidefinite.666Indeed, we have $C_{\mathbf{Z}\mathbf{X}}=C_{\mathbf{W}\mathbf{X}}$ and, by (209), $C_{\mathbf{Z}\mathbf{X}^{\prime}}=C_{\mathbf{Z}\mathbf{X}}A$ . Replacing in (216) $\mathbf{W}$ by $\mathbf{Z}$ , and repeating the steps leading to (219), we obtain the desired result. It follows that the smallest eigenvalue of $C_{\mathbf{W}|\mathbf{X}^{\prime}}$ is lower-bounded by $\frac{1}{12m^{2}}$ . Together with (221), this yields for the second term on the RHS of (LABEL:eq:kld_cond)

[TABLE]

To upper-bound the first term on the RHS of (LABEL:eq:kld_cond), we use that (222) combined with Lemma 11 implies that every diagonal element of $C_{\mathbf{W}|\mathbf{X}^{\prime}}$ is given by

[TABLE]

The first term on the RHS of (224) is negative, and the second term is upper-bounded by $\mathsf{E}\left[(X_{j}-Z_{j})^{2}\right]\leq 1/m^{2}$ . Hence, every element on the main diagonal of $C_{\mathbf{W}|\mathbf{X}^{\prime}}$ is upper-bounded by $\frac{1+1/12}{m^{2}}$ . It thus follows from Hadamard’s inequality that

[TABLE]

Combining (223) and (225) with (LABEL:eq:kld_cond) and (210) yields

[TABLE]

and completes the proof. ∎

Appendix F Spectral Distribution Function of $\{[\mathbf{X}_{t}]_{m}\}$

Let $\{\mathbf{X}_{t}\}$ be a stationary, $L$ -variate, Gaussian process with mean vector ${\boldsymbol{\mu}}=(\mu_{1},\ldots,\mu_{L})^{\textnormal{{\tiny T}}}$ and SDF $\mathsf{F}_{\mathbf{X}}$ . Let $\{\mathbf{Z}_{t}\}$ and $\{\mathbf{N}_{t}\}$ be defined as $Z_{i,t}\triangleq[X_{i,t}]_{m}$ and $N_{i,t}\triangleq X_{i,t}-[X_{i,t}]_{m}$ , respectively. For every pair $i,j=1,\dots,L$ , we have

[TABLE]

Bussgang’s theorem [26, eq. (20)] further yields that $K_{X_{i}Z_{j}}(\tau)=K_{Z_{j}X_{i}}(-\tau)=a_{j}K_{X_{i}X_{j}}(\tau)$ , where $a_{j}$ is defined in (45). Consequently,

[TABLE]

Since the SDF is fully determined by the covariance structure of a process [27, Th. 1, p. 206], we obtain (44).

To prove (47), namely,

[TABLE]

we note that

[TABLE]

Since $|X_{i,t}-Z_{i,t}|\leq\frac{1}{m}$ and $(\mu_{i}-\mathsf{E}\left[Z_{i,t}\right])^{2}\geq 0$ , the claim (47) follows.

It remains to prove (46), namely,

[TABLE]

Set $f(\alpha)\triangleq\frac{\alpha}{\sigma_{i}}e^{-\alpha^{2}/2\sigma_{i}^{2}}$ , $\alpha\in\mathbb{R}$ . We have

[TABLE]

Furthermore,

[TABLE]

It follows that

[TABLE]

Since $|\alpha-i/m|\leq 1/m$ for $\alpha\in[i/m,(i+1)/m]$ , this yields

[TABLE]

This proves (46) and concludes the proof of Lemma 11.

Appendix G Proof of Theorem 13

Let $\{\mathbf{Z}_{t}\}$ be a stationary, $L$ -variate, complex-valued process with matrix-valued SDF $\mathsf{F}_{\mathbf{Z}}$ . Let the real composite process $\{\hat{\mathbf{X}}_{t}\}$ be defined as $\hat{\mathbf{X}}_{t}\triangleq[\mathfrak{Re}(\mathbf{Z}_{t}^{\textnormal{{\tiny T}}}),\mathfrak{Im}(\mathbf{Z}_{t}^{\textnormal{{\tiny T}}})]^{\textnormal{{\tiny T}}}$ . That is, $\hat{\mathbf{X}}_{t}$ is obtained by stacking the real part of $\mathbf{Z}_{t}$ on top of the imaginary part of $\mathbf{Z}_{t}$ . Further let the augmented process $\{\hat{\mathbf{Z}}_{t}\}$ be defined as $\hat{\mathbf{Z}}_{t}\triangleq[\mathbf{Z}_{t}^{\textnormal{{\tiny T}}},{\mathbf{Z}}^{\mathsf{H}}_{t}]^{\textnormal{{\tiny T}}}$ . Clearly, $\hat{\mathbf{X}}_{t}$ and $\hat{\mathbf{Z}}_{t}$ satisfy $\hat{\mathbf{Z}}_{t}=T\hat{\mathbf{X}}_{t}$ , where

[TABLE]

is unitary up to a factor of $2$ , i.e., $T{T}^{\mathsf{H}}={T}^{\mathsf{H}}T=2I_{L}$ . The matrix-valued autocovariance function of $\{\hat{\mathbf{Z}}_{t}\}$ reads

[TABLE]

where $\overline{K}_{\mathbf{Z}}$ denotes the pseudo-autocovariance function of $\{\mathbf{Z}_{t}\}$ . The corresponding matrix-valued SDF is given by

[TABLE]

where $\overline{\mathsf{F}}_{\mathbf{Z}}$ satisfies

[TABLE]

The autocovariance functions and SDFs of $\{\hat{\mathbf{X}}_{t}\}$ and $\{\hat{\mathbf{Z}}_{t}\}$ are related via

[TABLE]

By definition, $\overline{\underline{d}}(\{\mathbf{Z}_{t}\})=\overline{\underline{d}}(\{\hat{\mathbf{X}}_{t}\})$ . It thus follows from Theorem 10 that

[TABLE]

Since left or right multiplication by a nonsingular matrix leaves the rank unchanged, we obtain from (241) that the rank of $\mathsf{F}^{\prime}_{\hat{\mathbf{X}}}(\theta)$ is equal to the rank of $\mathsf{F}^{\prime}_{\hat{\mathbf{Z}}}(\theta)$ . Furthermore, by (238), the rank of $\mathsf{F}^{\prime}_{\hat{\mathbf{Z}}}(\theta)$ is upper-bounded by the rank of $\mathsf{F}^{\prime}_{\mathbf{Z}}(\theta)$ plus the rank of $({\mathsf{F}^{\prime}_{\mathbf{Z}}})^{*}(-\theta)$ [28, Th. 1]. Consequently,

[TABLE]

where the second step follows because complex conjugation does not affect the rank.

If $\{{\mathbf{Z}}_{t}\}$ is Gaussian, then (242) holds with equality by Theorem 10. If $\{{\mathbf{Z}}_{t}\}$ is, in addition, proper then $\overline{K}_{\mathbf{Z}}(\tau)=0$ , so the derivative of $\overline{\mathsf{F}}_{\mathbf{Z}}$ is zero almost everywhere. Hence, the derivative of $\mathsf{F}_{\hat{\mathbf{Z}}}$ becomes block diagonal almost everywhere and its rank equals the sum of the ranks of its diagonal elements. We conclude that, if $\{\mathbf{Z}_{t}\}$ is proper Gaussian, then (243) holds with equality. This proves Theorem 13.

Appendix H Appendix to Section V

H-A Proof of Theorem 14

For every $m=2,3,\ldots$ and $k=1,2,\ldots$ we have

[TABLE]

by stationarity; and because conditioning reduces entropy and, conditioned on $X_{-\infty}^{0}$ , $[X_{1}]_{m}$ is independent of $[X_{-\infty}^{0}]_{m}$ . Note that, by (4) and stationarity,

[TABLE]

Thus, dividing (244) by $\log m$ and taking first the limit over $m$ and then the limit over $k$ yields

[TABLE]

This proves (73).

We next bound the difference $\overline{\underline{d}}^{\prime}(\{X_{t}\})-\overline{\underline{d}}(\{X_{t}\})$ . By (245), we have

[TABLE]

Dividing (247) by $\log m$ and taking first the limit over $m$ and then the limit over $k$ yields

[TABLE]

This concludes the proof of Theorem 14.

H-B Proof of Corollary 15

Suppose there exists a nonnegative $n$ such that

[TABLE]

We first show that

[TABLE]

In a second step, we then show that (249) implies that

[TABLE]

which together with (250) and (74) demonstrates that $\overline{\underline{d}}(\{X_{t}\})=\overline{\underline{d}}^{\prime}(\{X_{t}\})$ , thus proving Corollary 15.

To prove (250), we use the chain rule, stationarity, and the fact that conditioning reduces entropy, to obtain

[TABLE]

Having obtained (250), we next show that (249) implies (251). Indeed,

[TABLE]

where $n$ is a nonnegative integer satisfying (249). Here, the first inequality follows from the chain rule; the second inequality follows from the data processing inequality and by upper-bounding the second mutual information by $H([X_{-n+1}^{0}]_{m})$ .

The first limit on the RHS of (253) is zero because, by assumption, $I(X_{1}^{k};X_{-\infty}^{-n})<\infty$ . The second limit on the RHS of (253) can be written as $\varlimsup_{k\to\infty}\bar{d}(X_{-n+1}^{0})/k$ , which is zero because, by Lemma 1, $\bar{d}(X_{-n+1}^{0})$ is bounded in $k$ . This proves (251) and concludes the proof of Corollary 15.

H-C Proof of Lemma 16

Since $\{X_{t}\}$ is Gaussian, the conditional mean of $X_{k}$ given $X_{0},\dots,X_{k-1}$ can be written as

[TABLE]

for some coefficients $\alpha_{1},\ldots,\alpha_{k}$ .777More precisely, the coefficients correspond to the LMMSE estimator for estimating $X_{k}$ from $X_{0},\ldots,X_{k-1}$ . The LMMSE estimator always exists, even though it is not necessarily unique. The conditional variance $\sigma_{k}^{2}$ is thus given by (see, e.g., [19, Sec. 10.6])

[TABLE]

The function

[TABLE]

is analytic on the closed interval $[-1/2,1/2]$ , hence it is either constant or it has at most finitely many zeros in $[-1/2,1/2]$ . Moreover, $g$ cannot be the all-zero function, as can be argued by contradiction. Indeed, suppose there exist $\alpha_{1},\ldots,\alpha_{k}$ such that $g(\theta)=0$ for all $\theta$ . Then, by (255), we have $\sigma_{k}^{2}=0$ irrespective of $\mathsf{F}_{X}$ . In other words, we can find a linear estimator that perfectly predicts $X_{k}$ from $X_{0},\ldots,X_{k-1}$ irrespective of the SDF of $\{X_{t}\}$ . This is clearly a contradiction, since even the best predictor yields $\sigma_{k}^{2}=\sigma^{2}$ for an i.i.d., zero-mean, variance- $\sigma^{2}$ , Gaussian process, i.e., when $\mathsf{F}^{\prime}_{X}(\theta)=\sigma^{2}$ . Thus, the set $\mathcal{Z}\triangleq\{\theta:g(\theta)=0\}$ is finite and has therefore Lebesgue measure zero.

Since $|g(\theta)|^{2}=0$ for $\theta\in\mathcal{Z}$ , we have

[TABLE]

Since furthermore $|g(\theta)|^{2}>0$ for $\theta\in\mathcal{Z}^{\mathsf{c}}$ , we have $\sigma_{k}^{2}=0$ only if

[TABLE]

This implies that $\mathsf{F}^{\prime}_{X}(\theta)=0$ for all $\theta\in\mathcal{Z}^{\mathsf{c}}$ . Hence, the set of harmonics $\theta$ for which $\mathsf{F}^{\prime}_{X}(\theta)>0$ is contained in $\mathcal{Z}$ . The proof is completed by the monotonicity of measures and the fact that $\mathcal{Z}$ has Lebesgue measure zero.

Acknowledgment

Fruitful discussions with Amos Lapidoth are gratefully acknowledged. The authors further wish to thank the Associate Editor Matthieu Bloch and the anonymous referees for their valuable comments.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Rényi, “On the dimension and entropy of probability distributions,” Acta Mathematica Hungarica , vol. 10, no. 1-2, pp. 193–215, Mar. 1959.
2[2] T. Kawabata and A. Dembo, “The rate-distortion dimension of sets and measures,” IEEE Trans. Inf. Theory , vol. 40, no. 5, pp. 1564–1572, Sep. 1994.
3[3] T. Koch, “The Shannon lower bound is asymptotically tight,” IEEE Trans. Inf. Theory , vol. 62, no. 11, pp. 6155–6161, Nov. 2016.
4[4] Y. Wu and S. Verdú, “Rényi information dimension: Fundamental limits of almost lossless analog compression,” IEEE Trans. Inf. Theory , vol. 56, no. 8, pp. 3721–3748, Aug. 2010.
5[5] Y. Wu, S. Shamai (Shitz), and S. Verdú, “Information dimension and the degrees of freedom of the interference channel,” IEEE Trans. Inf. Theory , vol. 61, no. 1, pp. 256–279, Jan. 2015.
6[6] D. Stotz and H. Bölcskei, “Degrees of freedom in vector interference channels,” IEEE Trans. Inf. Theory , vol. 62, no. 7, pp. 4172–4197, Jul. 2016.
7[7] S. Jalali and H. V. Poor, “Universal compressed sensing for almost lossless recovery,” IEEE Trans. Inf. Theory , vol. 63, no. 5, pp. 2933–2953, May 2017.
8[8] F. E. Rezagah, S. Jalali, E. Erkip, and H. V. Poor, “Compression-based compressed sensing,” IEEE Trans. Inf. Theory , vol. 63, no. 10, pp. 6735–6752, Oct. 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

On the Information Dimension

Abstract

Index Terms:

I Introduction

II Notation and Preliminaries

III Rényi Information Dimension

Definition 1

III-A Properties of Information Dimension

Lemma 1** ([1, eq. (7)], [4, Prop. 1])**

Lemma 2

Proof:

Lemma 3

Proof:

Lemma 4

Proof:

Example 1

III-B Information Dimension of Finite-Variance RVs

Theorem 5

Proof:

IV The Information Dimension Rate

Definition 2

IV-A Properties of the Information Dimension Rate

Lemma 6

Proof:

Lemma 7

Proof:

Lemma 8

Proof:

IV-B Information Dimension Rate vs. Rate-Distortion Dimension

Definition 3

Theorem 9

Proof:

IV-C Information Dimension Rate of Finite-Variance Processes

Theorem 10

Proof:

Lemma 11

Proof:

Corollary 12

Proof:

IV-D Information Dimension Rate of Complex-Valued Processes

Theorem 13

Proof:

V Another Definition of Information Dimension

Definition 4

V-A *Block-Average Information Dimension vs.

Theorem 14

Proof:

Corollary 15

Proof:

Example 2

Example 3

Example 4

Lemma 16

Proof:

V-B *Block-Average Information Dimension vs.

VI Operational Characterizations

Theorem 17** ([7, Th. 8])**

Theorem 18** ([8, Cor. 2])**

Example 5

VII Conclusions

Appendix A Appendix to Section III

A-A Proof of Lemma 2

A-B Proof of Lemma 4

A-C Proof of Theorem 5

Lemma 19

Proof:

Appendix B Proof of Lemma 7

Lemma 20

Proof:

Appendix C Proof of Lemma 8

Appendix D Proof of Theorem 9

Appendix E Proof of Theorem 10

E-A Gaussian Processes Maximize the Information Dimension

E-B The Information Dimension of Gaussian Processes

Lemma 1 ([1, eq. (7)], [4, Prop. 1])

Theorem 17 ([7, Th. 8])

Theorem 18 ([8, Cor. 2])

E-B1 Restriction on $\mathcal{F}_{\Upsilon}^{\mathsf{c}}\subseteq[-1/2,1/2]$

E-B2 Bounding $\det\mathsf{F}^{\prime}_{\mathbf{W}}(\theta)$ by the Eigenvalues of $\mathsf{F}^{\prime}_{\mathbf{X}}(\theta)$

Appendix F Spectral Distribution Function of $\{[\mathbf{X}_{t}]_{m}\}$