Coherent multiple-antenna block-fading channels at finite blocklength

Austin Collins; Yury Polyanskiy

arXiv:1704.06962·cs.IT·June 28, 2018

Coherent multiple-antenna block-fading channels at finite blocklength

Austin Collins, Yury Polyanskiy

PDF

TL;DR

This paper derives finite blocklength limits for multi-antenna block-fading channels, revealing how antenna configuration impacts coding delay and highlighting the importance of orthogonal designs like Alamouti's scheme for optimal coding.

Contribution

It provides a formula for channel dispersion in multi-antenna block-fading channels and uncovers the significance of orthogonal designs in achieving dispersion-optimal coding schemes.

Findings

01

Capacity equivalence for $n_t\times n_r$ and $n_r \times n_t$ configurations at fixed SNR

02

Coding delay varies significantly with antenna configuration, e.g., 60% difference at 20 dB SNR

03

Orthogonal designs like Alamouti's scheme are dispersion-optimal for MISO channels.

Abstract

In this paper we consider a channel model that is often used to describe the mobile wireless scenario: multiple-antenna additive white Gaussian noise channels subject to random (fading) gain with full channel state information at the receiver. Dynamics of the fading process are approximated by a piecewise-constant process (frequency non-selective isotropic block fading). This work addresses the finite blocklength fundamental limits of this channel model. Specifically, we give a formula for the channel dispersion -- a quantity governing the delay required to achieve capacity. Multiplicative nature of the fading disturbance leads to a number of interesting technical difficulties that required us to enhance traditional methods for finding channel dispersion. Alas, one difficulty remains: the converse (impossibility) part of our result holds under an extra constraint on the growth of the…

Tables1

Table 1. TABLE I: Values for v ∗ ( n t , T ) superscript 𝑣 subscript 𝑛 𝑡 𝑇 v^{*}(n_{t},T)

$n_{t} ∖ T$	1	2	3	4	5	6	7	8
1	1	2	3	4	5	6	7	8
2		8	$10^{*}$	16	18	24	26	32
3			$21^{*}$	36	[39,45]	[46,54]	[57,63]	72
4				64	[68,80]	[80,96]	[100,112]	128
5					[89,125]	[118,150]	[155,175]	200
6						[168,216]	[222,252]	288
7							[301,343]	392
8								512

Equations613

lo g M^{*} (n, ϵ) = n C - nV Q^{- 1} (ϵ) + O (lo g n),

lo g M^{*} (n, ϵ) = n C - nV Q^{- 1} (ϵ) + O (lo g n),

n ≳ (\frac{Q ^{- 1} ( ϵ )}{1 - η})^{2} \frac{V}{C ^{2}} .

n ≳ (\frac{Q ^{- 1} ( ϵ )}{1 - η})^{2} \frac{V}{C ^{2}} .

Y_{j} = H_{j} X_{j} + Z_{j},

Y_{j} = H_{j} X_{j} + Z_{j},

P [H \neq = 0] > 0

P [H \neq = 0] > 0

P [W \neq = \hat{W}] \leq ϵ .

P [W \neq = \hat{W}] \leq ϵ .

W \to X^{n} \to (Y^{n}, H^{n}) \to \hat{W},

W \to X^{n} \to (Y^{n}, H^{n}) \to \hat{W},

j = 1 \sum n ∥ X_{j} ∥_{F}^{2} \leq n T P P \mbox - a . s .,

j = 1 \sum n ∥ X_{j} ∥_{F}^{2} \leq n T P P \mbox - a . s .,

C (P)

C (P)

= i = 1 \sum n_{m i n} E [C_{A W GN} (\frac{P}{n _{t}} Λ_{i}^{2})],

V (P) = △ P_{X} : I (X; Y ∣ H) = C in f \frac{1}{T} E [Var (i (X; Y, H) ∣ X)]

V (P) = △ P_{X} : I (X; Y ∣ H) = C in f \frac{1}{T} E [Var (i (X; Y, H) ∣ X)]

i (x; y, h) = △ lo g \frac{d P _{Y, H ∣ X = x}}{d P _{Y, H}^{*}} (y, h)

i (x; y, h) = △ lo g \frac{d P _{Y, H ∣ X = x}}{d P _{Y, H}^{*}} (y, h)

lo g M \geq n T C (P) - n T V (P) Q^{- 1} (ϵ) + o (n) .

lo g M \geq n T C (P) - n T V (P) Q^{- 1} (ϵ) + o (n) .

lo g M \leq n T C (P) - n T V (P) Q^{- 1} (ϵ) + δ_{n}^{'} n

lo g M \leq n T C (P) - n T V (P) Q^{- 1} (ϵ) + δ_{n}^{'} n

\displaystyle V_{iid}(P)=\

\displaystyle V_{iid}(P)=\

+ i = 1 \sum n_{m i n} E [V_{A W GN} (\frac{P}{n _{t}} Λ_{i}^{2})]

+ (\frac{P}{n _{t}})^{2} (η_{1} - \frac{η _{2}}{n _{t}})

c (σ)

c (σ)

η_{1}

η_{2}

V (P)

V (P)

+ (\frac{P}{n _{t}})^{2} (η_{1} - \frac{η _{2}}{n _{t}^{2} T} v^{*} (n_{t}, T))

v^{*} (n_{t}, T) = \frac{n _{t}^{2}}{2 P ^{2}} P_{X} : I (X; Y, H) = C max Var (∥ X ∥_{F}^{2})

v^{*} (n_{t}, T) = \frac{n _{t}^{2}}{2 P ^{2}} P_{X} : I (X; Y, H) = C max Var (∥ X ∥_{F}^{2})

V_{i}^{T} V_{i} = I_{n} i = 1, \dots, k

V_{i}^{T} V_{i} = I_{n} i = 1, \dots, k

V_{i}^{T} V_{j} + V_{j}^{T} V_{i} = 0 i \neq = j

ρ (2^{a} b) = 8 ⌊ \frac{a}{4} ⌋ + 2^{a mod 4}, a, b \in Z, \mbox b - - o dd .

ρ (2^{a} b) = 8 ⌊ \frac{a}{4} ⌋ + 2^{a mod 4}, a, b \in Z, \mbox b - - o dd .

\displaystyle V_{1}=I_{2},\quad V_{2}=\left[\begin{array}[]{cc}0&1\\ -1&0\\ \end{array}\right],

\displaystyle V_{1}=I_{2},\quad V_{2}=\left[\begin{array}[]{cc}0&1\\ -1&0\\ \end{array}\right],

v^{*} (T, n_{t}) = v^{*} (n_{t}, T) \leq n_{t} T min (n_{t}, T) .

v^{*} (T, n_{t}) = v^{*} (n_{t}, T) \leq n_{t} T min (n_{t}, T) .

v^{*} (n_{t}, T) = n_{t} T min (n_{t}, T) .

v^{*} (n_{t}, T) = n_{t} T min (n_{t}, T) .

\frac{n _{t}^{2}}{2 P ^{2}} Var (∥ X ∥_{F}^{2}) < n_{t} T min (n_{t}, T) .

\frac{n _{t}^{2}}{2 P ^{2}} Var (∥ X ∥_{F}^{2}) < n_{t} T min (n_{t}, T) .

C

C

X_{i, j} \sim ii d N (0, \frac{P}{n _{t}}),

X_{i, j} \sim ii d N (0, \frac{P}{n _{t}}),

P_{Y, H}^{*}

P_{Y, H}^{*}

P_{Y ∣ H}^{*}

P_{Y^{(j)} ∣ H = h}^{*}

\forall a \in R^{n_{t}}, b \in R^{T} : i = 1 \sum n_{t} j = 1 \sum T a_{i} b_{j} X_{i, j} \sim N (0, \frac{P}{n _{t}} ∥ a ∥_{2}^{2} ∥ b ∥_{2}^{2}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Coherent multiple-antenna block-fading channels at finite blocklength

Austin Collins and Yury Polyanskiy Authors are with the Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA 02139 USA. e-mail: {austinc,yp}@mit.edu. This material is based upon work supported by the National Science Foundation CAREER award under grant agreement CCF-12-53205, by the NSF grant CCF-17-17842 and by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-09-39370.

Abstract

In this paper we consider a channel model that is often used to describe the mobile wireless scenario: multiple-antenna additive white Gaussian noise channels subject to random (fading) gain with full channel state information at the receiver. Dynamics of the fading process are approximated by a piecewise-constant process (frequency non-selective isotropic block fading). This work addresses the finite blocklength fundamental limits of this channel model. Specifically, we give a formula for the channel dispersion – a quantity governing the delay required to achieve capacity. The multiplicative nature of the fading disturbance leads to a number of interesting technical difficulties that required us to enhance traditional methods for finding the channel dispersion. Alas, one difficulty remains: the converse (impossibility) part of our result holds under an extra constraint on the growth of the peak-power with blocklength.

Our results demonstrate, for example, that while capacities of $n_{t}\times n_{r}$ and $n_{r}\times n_{t}$ antenna configurations coincide (under fixed received power), the coding delay can be sensitive to this switch. For example, at the received SNR of $20$ dB the $16\times 100$ system achieves capacity with codes of length (delay) which is only $60\%$ of the length required for the $100\times 16$ system. Another interesting implication is that for the MISO channel, the dispersion-optimal coding schemes require employing orthogonal designs such as Alamouti’s scheme – a surprising observation considering the fact that Alamouti’s scheme was designed for reducing demodulation errors, not improving coding rate. Finding these dispersion-optimal coding schemes naturally gives a criteria for producing orthogonal design-like inputs in dimensions where orthogonal designs do not exist.

I Introduction

Given a noisy communication channel, the maximal cardinality of a codebook of blocklength $n$ which can be decoded with block error probability no greater than $\epsilon$ is denoted as $M^{*}(n,\epsilon)$ . The evaluation of this function – the fundamental performance limit of block coding – is alas computationally impossible for most channels of interest. As a resolution of this difficulty [1] proposed a closed-form normal approximation, based on the asymptotic expansion:

[TABLE]

where the capacity $C$ and dispersion $V$ are two intrinsic characteristics of the channel and $Q^{-1}(\epsilon)$ is the inverse of the $Q$ -function111As usual, $Q(x)=\int_{x}^{\infty}{1\over\sqrt{2\pi}}e^{-t^{2}/2}\,dt\,.$ . One immediate consequence of the normal approximation is an estimate for the minimal blocklength (delay) required to achieve a given fraction $\eta$ of the channel capacity:

[TABLE]

Asymptotic expansions such as (1) are rooted in the central-limit theorem and have been known classically for discrete memoryless channels [2, 3] and later extended in a wide variety of directions; see the surveys in [4, 5].

The fading channel is the centerpiece of the theory and practice of wireless communication, and hence there are many slightly different variations of the model: differing assumptions on the dynamics and distribution of the fading process, antenna configurations, and channel state knowledge. The capacity of the fading channel was found independently by Telatar [6] and Foschini and Gans [7] for the case of Rayleigh fading and channel state information available at the receiver only (CSIR) and at both the transmitter and receiver (CSIRT). Motivated by the linear gains promised by capacity results, space time codes were introduced to exploit multiple antennas, most notable amongst them is Alamouti’s ingenious orthogonal scheme [8] along with a generalization of Tarokh, Jafarkhani and Calderbank [9]. Motivated by a recent surge of orthogonal frequency division (OFDM) technology, this paper focuses on an isotropic channel gain distribution, which is piecewise independent (“block-fading”) and assume full channel state information available at the receiver (CSIR). This work describes finite blocklength effects incurred by the fading on the fundamental communication limits.

Some of the prior work on similar questions is as follows. Single antenna channel dispersion was computed in [10] for a more general stationary channel gain process with memory. In [11] finite-blocklength effects are explored for the non-coherent block fading setup. Quasi-static fading channels in the general MIMO setting have been thoroughly investigated in [12], showing that the expansion (1) changes dramatically (in particular the channel dispersion term becomes zero); see also [13] for evaluation of the bounds. Coherent quasi-static channel has been studied in the limit of infinitely many antennas in [14] appealing to concentration properties of random matrices. Dispersion for lattices (infinite constellations) in fading channels has been investigated in a sequence of works, see [15] and references. Note also that there are some very fine differences between stationary and block-fading channel models, cf. [16, Section 4]. The minimum energy to send $k$ bits over a MIMO channel for both the coherent and non-coherent case was studied in [17], showing the latter requires orders of magnitude larger latencies. [18] investigates the problem of power control with an average power constraint on the codebook in the quasi-static fading channel with perfect CSIRT. A novel achievability bound was found and evaluated for the fading channel with CSIR in [19]. Parts of this work have previously appeared in [20, 21].

The paper is organized as follows. In Section II we describe the channel model and state all our main results formally. Section III characterizes capacity achieving input/output distributions (caid/caod, resp.) and evaluates moments of the information density. Then in Sections IV and V we prove the achievability and converse parts of our (non rank-1) results, respectively. Section VI focuses on the special case of when the matrix of channel gains has rank 1. Finally, Section VII contains a discussion of numerical results and the behavior of channel dispersion as a function of the number of antennas.

The numerical software used to compute the achievability bounds, dispersion and normal approximation in this work can be found online under the Spectre project [22].

II Main Results

II-A Channel Model

The channel model considered in this paper is the frequency-nonselective coherent real block fading (BF) discrete-time channel with multiple transmit and receive antennas (MIMO) (See [23, Section II] for background on this model). We will simply refer to it as the MIMO-BF channel, which we formally define here. Given parameters $n_{t},n_{r},P,T$ as follows: let $n_{t}\geq 1$ be the number of transmit antennas, $n_{r}\geq 1$ be the number of receive antennas, and $T\geq 1$ be the coherence time of the channel. The input-output relation at block $j$ (spanning time instants $(j-1)T+1$ to $jT$ ) with $j=1,\ldots,n$ is given by

[TABLE]

where $\{H_{j},j=1,\ldots\}$ is a $n_{r}\times n_{t}$ matrix-valued random fading process, $X_{j}$ is a $n_{t}\times T$ matrix channel input, $Z_{j}$ is a $n_{r}\times T$ Gaussian random real-valued matrix with independent entries of zero mean and unit variance, and $Y_{j}$ is the $n_{r}\times T$ matrix-valued channel output. The process $H_{j}$ is assumed to be i.i.d. with isotropic distribution $P_{H}$ , i.e. for any orthogonal matrices $U\in\mathbb{R}^{n_{r}\times n_{r}}$ and $V\in\mathbb{R}^{n_{t}\times n_{t}}$ , both $UH$ and $HV$ are equal in distribution to $H$ . We also assume

[TABLE]

to avoid trivialities. Note that due to merging channel inputs at time instants $1,\ldots,T$ into one matrix-input, the block-fading channel becomes memoryless. We assume coherent demodulation so that the channel state information (CSI) $H_{j}$ is fully known to the receiver (CSIR).

An $(nT,M,\epsilon,P)_{CSIR}$ code of blocklength $nT$ , probability of error $\epsilon$ and power-constraint $P$ is a pair of maps: the encoder $f:[M]\to(\mathbb{R}^{n_{t}\times T})^{n}$ and the decoder $g:(\mathbb{R}^{n_{r}\times T})^{n}\times(\mathbb{R}^{n_{r}\times n_{t}})^{n}\to[M]$ satisfying the probability of error constraint

[TABLE]

on the probability space

[TABLE]

where the message $W$ is uniformly distributed on $[M]$ , $X^{n}=f(W)$ , $X^{n}\to(Y^{n},H^{n})$ is as described in (3), and $\hat{W}=g(Y^{n},H^{n})$ . In addition the input sequences are required to satisfy the power constraint:

[TABLE]

where $\|M\|_{F}^{2}\stackrel{{\scriptstyle\triangle}}{{=}}\sum_{i,j}M_{i,j}^{2}$ is the Frobenius norm of the matrix $M$ .

Under the isotropy assumption on $P_{H}$ , the capacity $C$ appearing in (1) of this channel is given by [6]

[TABLE]

where $C_{AWGN}(P)={1\over 2}\log(1+P)$ is the capacity of the additive white Gaussian noise (AWGN) channel with SNR $P$ , $n_{\min}=\min(n_{r},n_{t})$ is the minimum of the transmit and receive antennas, and $\{\Lambda_{i}^{2},i=1,\ldots,n_{\min}\}$ are eigenvalues of $HH^{T}$ . Note that it is common to think that as $P\to\infty$ the expression (7) scales as $n_{\min}\log P$ , but this is only true if $\mathbb{P}[\mathop{\rm rank}H=n_{\min}]=1$ .

The goal of this line of work is to characterize the dispersion of the present channel. Since the channel is memoryless it is natural to expect, given the results in [1, 10], that dispersion (for $\epsilon<1/2$ ) is given by

[TABLE]

where we denoted (single $T$ -block) information density by

[TABLE]

and $P_{Y,H}^{*}$ is the capacity achieving output distribution (caod). Justification of (8) as the actual (operational) dispersion, appearing in the expansion of $\log M^{*}(n,\epsilon)$ is by no means trivial and is the subject of this work.

II-B Statement of Main Theorems

Here we formally state the main results, then go into more detail in the following sections. Our first result is an achievability and partial converse bound for the MIMO-BF fading channel for fixed parameters $n_{t},n_{r},T,P$ .

Theorem 1.

For the MIMO-BF channel, there exists an $(nT,M,\epsilon,P)_{CSIR}$ maximal probability of error code with $0<\epsilon<1/2$ satisfying

[TABLE]

Furthermore, for any $\delta_{n}\to 0$ there exists $\delta^{\prime}_{n}\to 0$ so that every $(nT,M,\epsilon,P)_{CSIR}$ code with extra constraint that $\max_{j}\|x^{j}\|_{F}\leq\delta_{n}n^{1/4}$ , must satisfy

[TABLE]

where the capacity $C(P)$ is given by (6) and dispersion $V(P)$ by (8).222For the explicit expression for $i(x;y,h)$ see (III-C) below.

Proof.

This follows from Theorem 16 and Theorem 19 below. ∎

Remark 1.

Note that the converse has an extra constraint $\max_{j}\|x^{j}\|_{F}\leq\delta_{n}n^{1/4}$ . Mathematically, this constraint is needed so that the $n$ -fold information information density $i(x^{n};Y^{n},H^{n})$ behaves Gaussian-like, via the Berry-Esseen theorem. For example, if $x^{n}$ had $x_{11}=\sqrt{nTP}$ and zeroes in all other coordinates, then one term in the information density would be $O(n)$ while the rest would be $O(1)$ , and hence no asymptotic structure would emerge. All known bounds to obtain the channel dispersion rely on approximating the information density by a Gaussian, and hence a fundamentally different method of analysis is needed to handle the situation where $\max_{j}\|x^{j}\|_{F}\geq\delta_{n}n^{1/4}$ .

Note that to violate this constraint, a significant portion of the power budget must be poured into a single coherent block, which 1) creates a very large peak-to-average power ratio (PAPR) – an illegal (for regulating bodies) or impractical (for power amplifiers) situation, and 2) does a poor job of exploiting the diversity gain from coding over multiple independent coherent blocks. Therefore, our converse results are sufficient from the point of view of any practical system.

In addition, the random codebook used for the achievability (uniform on the power sphere) can be expurgated with a rate loss of $-\delta_{n}^{2}n^{-\tfrac{1}{2}}$ so that it entirely consists of codewords satisfying $\max_{j}\|x_{j}\|_{F}\leq\delta_{n}n^{1/4}$ . This is easiest to see by noticing that a standard Gaussian vector $Z^{n}$ satisfies $\mathbb{P}[\|Z^{n}\|_{\infty}>\delta_{n}n^{1/4}]\leq e^{-O(\delta_{n}^{2}\sqrt{n})}$ . This observation shows that our analysis of the random coding bound (with spherical codebook) is tight in terms of the dispersion term.

Remark 2.

The remainder term $o(\sqrt{n})$ in (11) depends on the system parameters $(n_{t},n_{r},T,P_{H})$ in a complicated way, which we do not attempt to study here.

The behavior of dispersion found in Theorem 1 turns out to depend crucially on whether $\mathop{\rm rank}(H)\leq 1$ a.s. or not. When $\mathop{\rm rank}(H)>1$ , all capacity achieving input distributions (caids) yield the same conditional variance (8), yet when $\mathop{\rm rank}(H)\leq 1$ , the conditional variance varies over the set of caids. The following theorem discusses the case where $\mathbb{P}[\mathop{\rm rank}H>1]>0$ . In this case, the dispersion (8) can be calculated for the simplest Telatar caid (i.i.d. Gaussian matrix $X$ ). The following theorem gives full details.

Theorem 2.

Assume that $\mathbb{P}[\mathop{\rm rank}H>1]>0$ , then $V(P)=V_{iid}(P)$ , where

[TABLE]

where $\{\Lambda_{i}^{2},i=1,\ldots,n_{\min}\}$ are eigenvalues of $HH^{T}$ , $V_{AWGN}(P)=\frac{\log^{2}e}{2}\left(1-\frac{1}{\left(1+P\right)^{2}}\right)$ , and

[TABLE]

Proof.

This is proved in Proposition 11 below. ∎

Remark 3.

Each of the three terms in (12) is non-negative, see Remark 7 below for more details.

In the case where the fading process has rank 1 (e.g. for MISO systems), there are a multitude of caids, and the minimization problem in (8) is non-trivial. Quite surprisingly, for some values of $n_{t},T$ , we show that the (essentially unique) minimizer is a full-rate orthogonal design. The latter were introduced into the field of communications by Alamouti [8] and Tarokh et al [9]. This shows a somewhat unexpected connection between schemes optimal from modulation-theoretic and information-theoretic points of view. The precise results are as follows.

Theorem 3.

When $\mathbb{P}[\text{rank}(H)\leq 1]=1$ , we have

[TABLE]

where $\Lambda^{2}$ is the non-zero eigenvalues of $HH^{T}$ , and

[TABLE]

Proof.

This is the content of Proposition 12 below.∎

The quantity $v^{*}(n_{t},T)$ is defined separately in Theorem 3 because it isolates how the dispersion depends on the input distribution. Unfortunately, $v^{*}(n_{t},T)$ is generally unknown, since the maximization in (18) is over a manifold of matrix-valued random variables. However, for many dimensions, the maximum can be found by invoking the Hurwitz-Radon theorem [24]. We state this below to introduce the notation, and expand on it in Section VI.

Theorem 4 (Hurwitz-Radon).

There exists a family of $n\times n$ real matrices $V_{1},\ldots,V_{k}$ satisfying

[TABLE]

if and only if $k\leq\rho(n)$ , where

[TABLE]

In particular, $\rho(n)\leq n$ and $\rho(n)=n$ only for $n=1,2,4,8$ .

For a concrete example, note that Alamouti’s scheme is created from a Hurwitz-Radon family for $n=k=2$ . Indeed, take the matrices

[TABLE]

then Alamouti’s orthogonal design can be formed by taking $aV_{1}+bV_{2}$ . It turns out that “maximal” Hurwitz-Radon families give capacity achieving input distributions for the MIMO-BF channel, see Proposition 22 for the details.

The following theorem summarizes our current knowledge of $v^{*}(n_{t},T)$ .

Theorem 5.

For any pair of positive integers $n_{t},T$ we have

[TABLE]

If $n_{t}\leq\rho(T)$ or $T\leq\rho(n_{t})$ then a full-rate orthogonal design is dispersion-optimal and

[TABLE]

If instead $n_{t}>\rho(T)$ and $T>\rho(n_{t})$ then for a jointly-Gaussian capacity-achieving input $X$ we have333So that in these cases the bound (22) is either non-tight, or is achieved by a non-jointly-Gaussian caid.

[TABLE]

Finally, if $n_{t}\leq T$ and (23) holds, then $v^{*}(n_{t}^{\prime},T)=n_{t}^{\prime 2}T$ for any $n_{t}^{\prime}\leq n_{t}$ (and similarly with the roles of $n_{t}$ and $T$ switched).

Note that the $\rho(n)$ function is monotonic in even values of $n$ (and is $1$ for $n$ odd), and $\rho(n)\to\infty$ along even $n$ . Therefore, for any number of transmit antennas $n_{t}$ , there is a large enough $T$ such that $n_{t}\leq\rho(T)$ , in which case an $n_{t}\times T$ full rate orthogonal design achieves the optimal $v^{*}(n_{t},T)$ .

III Preliminary results

The section gives some results that will be useful for the achievability and converse proofs (Theorem 16 and Theorem 19, respectively), along with generally aiding our understanding of the MIMO-BF channel at finite blocklength. The results in this section and where they are used is summarized as follows:

•

Theorem 6 gives a characterization of the caids for MIMO-BF channel. While all caids give the same capacity (by definition), when the channel matrix is rank 1, they do not all yield the same dispersion. This characterization is needed to reason about the minimizers in (8), especially in the rank 1 case.

•

Proposition 8 computes variance $V_{n}(x^{n})$ of information density conditioned on the channel input $x^{n}$ . A key characteristic of the fading channel is that $V_{n}(x^{n})$ varies as $x^{n}$ moves around the input space, which does not happen in DMC’s or the AWGN channel. This variation in $V_{n}(x^{n})$ poses additional challenges in the converse proof, where we partition the codebook based on thresholding $V_{n}(x^{n})$ (see the proof of Theorem 19 for details). Knowledge of $V_{n}(x^{n})$ will also allow us to understand when the information density can be well approximated by a Gaussian (see Lemma 13).

•

Propositions 11 and 12 explicitly give the expression for the dispersion found from the achievability and converse proofs for the $\mathop{\rm rank}(H)>1$ and $\mathop{\rm rank}(H)\leq 1$ case, respectively. These expressions show how the dispersion depends on $n_{t},n_{r},T,P$ , and are the contents of Theorems 2 and 3 above.

III-A Known results: capacity and capacity achieving output distribution

First we review a few known results on the MIMO-BF channel. Since the channel is memoryless, the capacity is given by

[TABLE]

It was shown by Telatar [6] that whenever distribution of $H$ is isotropic, the input $X\in\mathbb{R}^{n_{t}\times T}$ with entry $i,j$ given by

[TABLE]

is a maximizer, resulting in the capacity formula (6). The distribution induced by a caid at the channel output $(Y,H)$ is called the capacity achieving output distribution (caod). A classical fact is that, while there may be many caids, the caod is unique, e.g. [25, Section 4.4]. Thus, from (26) we infer that the caod is given by

[TABLE]

$Y=[Y^{(1)},\ldots,Y^{(T)}]$ , where $Y^{(j)}$ is $j$ -th column of $Y$ , which, as we specified in (3), is a $n_{r}\times T$ matrix.

III-B Capacity achieving input distributions

A key feature of the MIMO-BF channel is that it has many caids, whereas many commonly studied channels (e.g. BSC, BEC, AWGN) have a unique caid. Understanding the set of distributions that achieve capacity is essential for reasoning about the minimizer of the condition variance in (8). The following theorem characterizes the set of caids for the MIMO-BF channel. Somewhat surprisingly, for the case of rank-1 $H$ (e.g. for MISO) there are multiple non-trivial jointly Gaussian caids with different correlation structures. For example, space-time block codes can achieve the capacity in the rank 1 case, but do not achieve capacity when the rank is 2 or greater e.g. [26].

Theorem 6.

Every caid $X$ satisfies

[TABLE]

If $\mathbb{P}[\mathop{\rm rank}H\leq 1]=1$ then condition (30) is also sufficient for $X$ to be caid. 2. 2.

Let $X=\begin{pmatrix}R_{1}\cr\cdots\cr R_{n_{t}}\end{pmatrix}$ be decomposed into rows $R_{i}$ . If $X$ is a caid, then each $R_{i}\sim\mathcal{N}(0,{P\over n_{t}}I_{T})$ (i.i.d. Gaussian) and

[TABLE]

If $X$ is jointly zero-mean Gaussian and $\mathbb{P}[\mathop{\rm rank}H\leq 1]=1$ , then (31)-(32) are sufficient for $X$ to be caid. 3. 3.

Let $X=(C_{1}\ldots C_{T})$ be decomposed into columns $C_{j}$ . If $X$ is a caid, then each $C_{j}\sim\mathcal{N}(0,{P\over n_{t}}I_{n_{t}})$ (i.i.d. Gaussian) and

[TABLE]

If $X$ is jointly zero-mean Gaussian and $\mathbb{P}[\mathop{\rm rank}H\leq 1]=1$ , then (33)-(34) are sufficient for $X$ to be caid. 4. 4.

When $\mathbb{P}[\mathop{\rm rank}H>1]>0$ , any caid has pairwise independent rows:

[TABLE]

and in particular

[TABLE]

Therefore, among jointly Gaussian $X$ the i.i.d. $X_{i,j}$ is the unique caid. 5. 5.

There exist non-Gaussian caids if and only if $\mathbb{P}[\mathop{\rm rank}H\geq\min(n_{t},T)]=0$ .

Remark 4.

(Special case of rank-1 $H$ ) In the MISO case when $n_{t}>1$ and $n_{r}=1$ (or more generally, $\mathop{\rm rank}H\leq 1$ a.s.), there is not only a multitude of caids, but in fact they can have non-trivial correlations between entries of $X$ (and this is ruled out by (36) for all other cases). As an example, for the $n_{t}=T=2$ case, any of the following random matrix-inputs $X$ (parameterized by $\rho\in[-1,1]$ ) is a Gaussian caid:

[TABLE]

where $\xi_{1},\xi_{2},\xi_{3},\xi_{4}\sim\mathcal{N}(0,1)$ i.i.d.. In particular, there are caids for which not all entries of $X$ are pairwise independent.

Remark 5.

Another way to state conditions (31)-(32) is: all elements in a row (resp. column) are pairwise independent $\sim\mathcal{N}(0,\frac{P}{n_{t}})$ and each $2\times 2$ minor has antipodal correlation for the two diagonals. In particular, if $X$ is a caid, then $X^{T}$ and any submatrix of $X$ are caids too (for different $n_{t}$ and $T$ ).

Proof.

We will rely repeatedly on the following observations:

if $A,B$ are two random vectors in $\mathbb{R}^{n}$ then for any $v\in\mathbb{R}^{n}$ we have

[TABLE]

This is easy to show by computing characteristic functions. 2. 2.

If $A,B$ are two random vectors in $\mathbb{R}^{n}$ independent of $Z\sim\mathcal{N}(0,I_{n})$ , then

[TABLE]

This follows from the fact that the characteristic function of $Z$ is nowhere zero. 3. 3.

For two matrices $Q_{1},Q_{2}\in\mathbb{R}^{n\times n}$ we have

[TABLE]

This follows from the fact that a quadratic form that is zero everywhere on $\mathbb{R}^{n}$ must have all coefficients equal to zero.

Part 1 (necessity). Recall that the caod is unique and given by (27). Thus an input $X$ is a caid iff for $P_{H}$ -almost every $h_{0}\in\mathbb{R}^{n_{r}\times n_{t}}$ we have

[TABLE]

where $G$ is an $n_{t}\times T$ matrix with i.i.d. $\mathcal{N}(0,P/n_{t})$ entries (for sufficiency, just write $I(X;Y,H)=h(Y|H)-h(Z)$ with $h(\cdot)$ denoting differential entropy). We will argue next that (43) implies (under isotropy assumption on $P_{H}$ ) that

[TABLE]

From (40), (44) is equivalent to $\sum_{i,j}a_{i}b_{j}X_{i,j}\stackrel{{\scriptstyle d}}{{=}}\sum_{i,j}a_{i}b_{j}G_{i,j}$ for all $b\in\mathbb{R}^{n_{t}}$ .

Let $E_{0}$ be a $P_{H}$ -almost sure subset of $\mathbb{R}^{n_{t}\times n_{r}}$ for which (43) holds. Let $O(n)=\{U\in\mathbb{R}^{n\times n}:U^{T}U=UU^{T}=I_{n}\}$ denote the group of orthogonal matrices, with the topology inherited from $\mathbb{R}^{n\times n}$ . Let $\{U_{k}\}$ and $\{V_{k}\}$ for $k\in\{1,2,\ldots\}$ be countable dense subsets of $O(n_{t})$ and $O(n_{r})$ , respectively. (These exist since $\mathbb{R}^{n^{2}}$ is a second-countable topological space). By isotropy of $P_{H}$ we have $P_{H}[U_{k}(E_{0})V_{l}]=1$ and therefore

[TABLE]

is also almost sure: $P_{H}[E]=1$ , since $E$ is the intersection of countably many almost sure sets. Here, $U_{k}(E_{0})$ denotes the image of $E_{0}$ under $U_{k}$ . By assumption (4), $E$ must contain a non-zero element $h_{0}$ , for otherwise we would have $P_{H}[0]=1$ , contradicting (4). Consequently, $h_{0}\in U_{k}(E_{0})V_{l}$ for all $k,l$ , and so $U_{k}^{-1}h_{0}V_{l}^{-1}\in E_{0}$ for all $k,l$ . Since for $U\in O(n)$ , the map $U\mapsto U^{-1}$ is a bijective continuous transformation of $O(n)$ , we have that $\{U_{k}^{-1}\}$ and $\{V_{l}^{-1}\}$ are also countable dense subsets of $O(n_{t})$ and $O(n_{r})$ , respectively. From (41) and (43) along with the definition of $E_{0}$ , we conclude that

[TABLE]

Arguing by continuity and using the density of $\{U_{k}^{-1}\}$ and $\{V_{l}^{-1}\}$ , this implies also

[TABLE]

In particular, for any $a\in\mathbb{R}^{n_{t}}$ there must exist a choice of $U,V$ such that $Uh_{0}V$ has the top row equal to $c_{0}a^{T}$ for some constant $c_{0}>0$ . Choosing these $U,V$ in (46) and comparing distributions of top rows, we conclude (44) after scaling by $1/c_{0}$ .

Part 1 (sufficiency). Suppose $\mathbb{P}[\mathop{\rm rank}H\leq 1]=1$ . Then our goal is to show that (44) implies that $X$ is a caid. To that end, it is sufficient to show $h_{0}X\stackrel{{\scriptstyle d}}{{=}}h_{0}G$ for all rank-1 $h_{0}$ . In the special case

[TABLE]

the claim follows directly from (44). Every other rank-1 $h_{0}^{\prime}$ can be decomposed as $h_{0}^{\prime}=Uh_{0}$ for some matrix $U$ , and thus again we get $Uh_{0}X\stackrel{{\scriptstyle d}}{{=}}Uh_{0}G$ , concluding the proof.

Parts 2 and 3 (necessity). From part 1 we have that for every $a,b$ we must have $a^{T}Xb\sim\mathcal{N}(0,\|a\|_{2}^{2}\|b\|_{2}^{2}{P\over n_{t}})$ . Computing expected square we get

[TABLE]

Thus, expressing the left-hand side in terms of rows $R_{i}$ as $a^{T}X=\sum_{i}a_{i}R_{i}$ we get

[TABLE]

and thus by (42) we conclude that for all $a$ :

[TABLE]

Each entry of the $T\times T$ matrices above is a quadratic form in $a$ and thus again by (42) we conclude (31)-(32). Part 3 is argued similarly with roles of $a$ and $b$ interchanged.

Parts 2 and 3 (sufficiency). When $H$ is (at most) rank-1, we have from part 1 that it is sufficient to show that $a^{T}Xb\sim\mathcal{N}(0,\|a\|_{2}^{2}\|b\|_{2}^{2}{P\over n_{t}})$ . When $X$ is jointly zero-mean Gaussian, we have $a^{T}Xb$ is zero-mean Gaussian and so we only need to check its second moment satisfies (47). But as we just argued, (47) is equivalent to either (31)-(32) or (33)-(34).

Part 4. As in Part 1, there must exist $h_{0}\in\mathbb{R}^{n_{r}\times n_{t}}$ such that (46) holds and $\mathop{\rm rank}h_{0}>1$ . Thus, by choosing $U,V$ we can diagonalize $h_{0}$ and thus we conclude any pair of rows $R_{i},R_{j}$ must be independent.

Part 5. This part is never used in subsequent parts of the paper, so we only sketch the argument and move the most technical part of the proof to Appendix A. Let $\ell=\max\{r:\mathbb{P}[\mathop{\rm rank}H\geq r]>0\}$ . Then arguing as for (46) we conclude that $X$ is a caid if and only if for any $h$ with $\mathop{\rm rank}h\leq\ell$ we have

[TABLE]

In other words, we have

[TABLE]

If $\ell=\min(n_{t},T)$ , then rank condition on $a$ is not active and hence, we conclude by (40) that $X\stackrel{{\scriptstyle d}}{{=}}G$ . So assume $\ell<\min(n_{t},T)$ . Note that (48) is equivalent to the condition on characteristic function of $X$ as follows:

[TABLE]

It is easy to find polynomial (in $a_{i,j}$ ) that vanishes on all matrices of rank $\leq\ell$ (e.g. take the product of all $\ell\times\ell$ minors). Then Proposition 24 in Appendix A constructs non-Gaussian $X$ satisfying (49) and hence (48). ∎

III-C Information density and its moments

In finite blocklength analysis, a key object of study is the information density, along with its first and second moments. In this section we’ll find expressions for these moments, along with showing when the information density is asymptotically normal.

It will be convenient to assume that the matrix $H$ is represented as

[TABLE]

where $U,V$ are uniformly distributed on $O(n_{r})$ and $O(n_{t})$ (which follows from the isotropic assumption on $H$ ), respectively,444Recall that $O(m)=\{A\in\mathbb{R}^{m\times m}:AA^{T}=A^{T}A=I_{m}\}$ is the space of all orthogonal matrices. This space is compact in a natural topology and admits a Haar probability measure. and $\Lambda$ is the $n_{r}\times n_{t}$ diagonal matrix with diagonal entries $\{\Lambda_{i},i=1,\ldots,n_{\min}\}$ . Joint distribution of $\{\Lambda_{i}\}$ depends on the fading model. It does not matter for our analysis whether $\Lambda_{i}$ ’s are sorted in some way, or permutation-invariant.

For the MIMO-BF channel, let $P_{YH}^{*}$ denote the caod (27). To compute the information density with respect to $P_{YH}^{*}$ (for a single $T$ -block of symbols) as defined in (9), denote $y=hx+z$ and write an SVD decomposition for matrix $h$ as

[TABLE]

where $u\in O(n_{r})$ , $v\in O(n_{t})$ and $\lambda$ is an $n_{r}\times n_{t}$ matrix which is zero except for the diagonal entries, which are equal to $\lambda_{1},\ldots,\lambda_{n_{\min}}$ . Note that this representation is unique up to permutation of $\{\lambda_{j}\}$ , but the choice of this permutation will not affect any of the expressions below. With this decomposition we have:

[TABLE]

where we denoted by $v_{j}$ the $j$ -th column of $V$ , and have set $\tilde{z}=u^{T}z$ , with $\tilde{z}_{j}$ representing the $j$ -th row of $\tilde{z}$ . The definition naturally extends to blocks of length $nT$ additively:

[TABLE]

We compute the (conditional) mean of information density to get

[TABLE]

where we used the following simple fact:

Lemma 7.

Let $U\in\mathbb{R}^{1\times n_{t}}$ be uniformly distributed on the unit sphere, and $x\in\mathbb{R}^{n_{t}\times T}$ be a fixed matrix, then

[TABLE]

Proof.

Note that by additivity of $\|Ux\|^{2}$ across columns, it is sufficient to consider the case $T=1$ , for which the statement is clear from symmetry. ∎

Remark 6.

A simple consequence of Lemma 7 is $\mathbb{E}[\|Hx\|_{F}^{2}]=\mathbb{E}[\|H\|_{F}^{2}]\frac{\|x\|_{F}^{2}}{n_{t}}$ , which follows from considering the SVD of $H$ .

Proposition 8.

Let $V_{n}(x^{n})\stackrel{{\scriptstyle\triangle}}{{=}}\frac{1}{nT}\mathrm{Var}(i(X^{n};Y^{n},H^{n})|X^{n}=x^{n})$ , then we have

[TABLE]

where the function $V_{1}:\mathbb{R}^{n_{t}\times T}\mapsto\mathbb{R}$ defined as $V_{1}(x)\triangleq\frac{1}{T}\mathrm{Var}(i(X;Y,H)|X=x)$ is given by

[TABLE]

where $c(\cdot)$ was defined in (13) and

[TABLE]

Remark 7.

Every term in the definition of $V_{1}(x)$ (except the one with $\eta_{5}$ ) is non-negative (for $\eta_{4}$ -term, see (90)). The $\eta_{5}$ -term will not be important because for inputs satisfying power-constraint with equality it vanishes. Note also that the first term in (65) can alternatively be given as

[TABLE]

Proof.

From (III-C), we have the form of the information density. First note that the information density over $n$ channel uses decomposes into a sum of $n$ independent terms,

[TABLE]

As such, the variance conditioned on $x^{n}$ also decomposes as

[TABLE]

from which (56) follows. Because the variance decomposes as a sum in (67), we focus on only computing $\mathrm{Var}(i(x;Y,H))$ for a single coherent block. Define

[TABLE]

so that $i(x;y,h)=f(h)+g(x,h,z)$ in notation from (III-C). With this, the quantity of interest is

[TABLE]

where (71) follows from the identity

[TABLE]

Below we show that $T_{1}$ and $T_{3}$ corresponds to (59), $T_{2}$ corresponds to (57), $T_{4}$ corresponds to (58), and $T_{3}$ corresponds to (60) and (61). We evaluate each term separately.

[TABLE]

where (75) follows from noting that

[TABLE]

Now, since $V_{k}$ is independent from $\Lambda_{k}$ by the rotational invariance assumption, we have that $f(H)$ is independent from $V_{k}$ , since $f(H)$ only depends on $H$ through its eigenvalues, see (62). We are only concerned with the expectation over $g(x,H,Z)$ in (74), which reduces to

[TABLE]

giving (75).

Next, $T_{2}$ in (71) becomes

[TABLE]

For $T_{3}$ in (71), we obtain

[TABLE]

where

•

(82) follows from taking the variance over $\tilde{Z}$ (recall $\tilde{Z}=U^{T}Z$ in (III-C)).

•

(83) follows from Lemma 7 applied to $\mathbb{E}[\|V_{k}^{T}x\|^{2}]$ , and adding and subtracting the term

[TABLE]

Continuing with $T_{3}$ from (71),

[TABLE]

where

•

(87) follows from taking the expectation over $\tilde{Z}$ ,

•

(88) follows from applying the variance identity (72) with respect to $V$ and $\Lambda_{1},\ldots,\Lambda_{n_{\min}}$ , as well as recalling (63).

We are left to show that the term (88) equals (61). To that end, define

[TABLE]

We will finish the proof by showing

[TABLE]

To that end, we first compute moments of $V$ drawn from the Haar measure on the orthogonal group.

Lemma 9.

Let $V$ be drawn from the Haar measure on $O(n)$ , then for $i,j,k,l=1,\ldots,n$ all unique,

[TABLE]

Proof of this Lemma is given below.

First, note that the variance $\mathrm{Var}(\|V_{k}^{T}x\|^{2})$ does not depend on $k$ , since the marginal distribution of each $V_{k}$ is uniform on the unit sphere. Hence below we only consider $V_{1}$ . We obtain

[TABLE]

where $r_{j}$ denotes the $j$ -th row of $x$ . Now it is a matter counting similar terms.

[TABLE]

where

•

(100) follows from collecting like terms from the summation in (99).

•

(101) uses Lemma 9 to compute each expectation.

•

(102) follows from realizing that

[TABLE]

Plugging this back into (97) yields the variance term,

[TABLE]

Now we compute the covariance term from (90) in a similar way. By symmetry of the columns of $V$ , we can consider only the covariance between $\|V_{1}^{T}x\|^{2}$ and $\|V_{2}^{T}x\|^{2}$ , i.e.

[TABLE]

Expanding the expectation, we get

[TABLE]

With this, we obtain from (106),

[TABLE]

where the steps follow just as in the variance computation (100)-(102).

Finally, returning to (90), using the variance (105) and covariance (112), we obtain

[TABLE]

Plugging this into (88) finishes the proof. ∎

Proof of Lemma 9.

We first note that all entries of $V$ have identical distribution, since permutations of rows and columns leave the distribution invariant. Because of this, we can WLOG only consider $V_{11},V_{12},V_{21},V_{22}$ to prove the lemma.

•

(91) follows immediately from $\sum_{i=1}^{n}V_{ij}^{2}=1$ a.s.

•

Let $V_{i},V_{j}$ be any two distinct columns of $V$ , then (92) follows from

[TABLE]

•

For (93) and (96), let $E_{1}=\mathbb{E}[V_{11}^{4}]$ and $E_{2}=\mathbb{E}[V_{11}^{2}V_{21}^{2}]$ . The following relations between $E_{1},E_{2}$ hold,

[TABLE]

and, noticing that multiplication of $V$ by the matrix

[TABLE]

where $I_{n}$ is the $n\times n$ identity matrix. This is an orthogonal matrix, so we obtain the relation

[TABLE]

from which we obtain $E_{1}=3E_{2}$ . With this and (116), we obtain

[TABLE]

•

For (94), take

[TABLE]

Solving for $E_{3}$ yields (94).

•

For (96), let $V_{1},V_{2}$ denote the first and second column of $V$ respectively, and let $E_{4}=\mathbb{E}[V_{11}V_{12}V_{21}V_{22}]$ , then (96) follows from

[TABLE]

Using $E_{2}$ from (124) and solving for $E_{4}$ gives (96).

∎

The following propsition gives the value of the conditional variance of the information density when input distribution has i.i.d. $\mathcal{N}(0,P/n_{t})$ entries. This will turn out to be the operational dispersion in the case where $\mathop{\rm rank}H>1$ .

Proposition 10.

Let $X^{n}=(X_{1},\ldots,X_{n})$ be i.i.d. with Telatar distribution (26) for each entry. Then

[TABLE]

where $V_{iid}(P)$ is the right-hand side of (12).

Proof.

To show this, we take the expectation of the expression given in Proposition 8 when $X^{n}$ has i.i.d. $\mathcal{N}(0,P/n_{t})$ entries. The terms (57) and (58) do not depend on $X^{n}$ , and these give us the first two terms in (12). (59) vanishes immediately, since $\mathbb{E}[\|X\|_{F}^{2}]=TP$ by the power constraint. It is left to compute the expectation over (60) and (61) from the expression in Proposition 8. Using identities for $\chi^{2}$ distributed random variables (namely, $\mathbb{E}\,[\chi^{2}_{k}]=k$ , $\mathrm{Var}(\chi^{2}_{k})=2k$ ), we get:

[TABLE]

Hence, the sum of terms in (60) + (61) after taking expectation over $X^{n}$ yields

[TABLE]

Introducing random variables $U_{i}=c(\Lambda_{i}^{2})$ the expression in the square brackets equals

[TABLE]

At the same time, the third term in expression (12) is

[TABLE]

One easily checks that (135) and (136) are equal. ∎

The next proposition shows that, when the rank of $H$ is larger than $1$ , the conditional variance in (8) is constant over the set of caids. Thus we can compute the conditional variance for the i.i.d. $\mathcal{N}(0,P/n_{t})$ caid, and conclude that this expression is the minimizer in (8).

Proposition 11.

If $\mathbb{P}[\mathop{\rm rank}H>1]>0$ , then for any caid $X\sim P_{X}$ we have

[TABLE]

In particular, the $V(P)$ defined as infimum over all caids (8) satisfies $V(P)=V_{iid}(P)$ .

Proof.

For any caid the term (59) vanishes. Let $X^{*}$ be Telatar distributed. To analyze (60) we recall that from (36) we have

[TABLE]

For the term (61) we notice that

[TABLE]

where $R_{i}$ is the $i$ -th row of $X$ . By (35) from Theorem 6 we then also have

[TABLE]

To conclude, $\mathbb{E}\,[V_{1}(X)]=\mathbb{E}\,[V_{1}(X^{*})]=V_{iid}(P)$ . ∎

In the case where $\mathop{\rm rank}H\leq 1$ , it turns out that the conditional variance does vary over the set of caids. The following proposition gives the expression for the conditional variance in this case, as a function of the caid.

Proposition 12.

If $\mathbb{P}[\text{rank}(H)\leq 1]=1$ , then for any capacity achieving input $X$ we have

[TABLE]

where $\eta_{1},\eta_{2}$ are defined in (14)-(15).

Proof.

As in Prop. 10 we need to evaluate the expectation of terms in (59)-(61). Any caid $X$ should satisfy $\mathbb{E}\,[\|X\|_{F}^{2}]=TP$ and thus the term (59) is zero. The term (60) can be expressed in terms of $\mathrm{Var}(\|X\|_{F}^{2})$ , but the (61) presents a non-trivial complication due to the presence of $\|XX^{T}\|_{F}^{2}$ , whose expectation is possible (but rather tedious) to compute by invoking properties of caids established in Theorem 6. Instead, we recall that the sum (60)+(61) equals (88). Evaluation of the latter can be simplified in this case due to constraint on the rank of $H$ . Overall, we get

[TABLE]

where $c(\cdot)$ is from (13). The last term in (140) can be written as

[TABLE]

which follows from the identity $\mathrm{Var}(AB)=\mathbb{E}[A^{2}]\mathbb{E}[B^{2}]-\mathbb{E}^{2}[A]\mathbb{E}^{2}[B]$ for independent $A,B$ . The second term in (141) is easily handled since from Lemma 7 we have $\mathbb{E}[\|V_{1}^{T}X\|_{F}^{2}|X]=\|X\|_{F}^{2}/n_{t}$ . To compute the first term in (141) recall from Theorem 6 that for any fixed unit-norm $v$ and caid $X$ we must have $v^{T}X\sim\mathcal{N}(0,P/n_{t}I_{T})$ . Therefore, we have

[TABLE]

Putting everything together we get that (141) equals

[TABLE]

concluding the proof. ∎

The question at hand is: which input distribution $X$ that achieves capacity minimizes (137)? Proposition 12 reduces this problem to maximizing $\mathrm{Var}(\|X\|_{F}^{2})$ over the set of capacity achieving input distributions. This will be analyzed in Section VI.

Finally, the following lemma computes the Berry Esseen constant. This is a technical result that will be needed for both the achievability and converse proofs.

Lemma 13.

Fix $x_{1},\ldots,x_{n}\in\mathbb{R}^{n_{t}\times T}$ and let $W_{j}=i(x_{j};Y_{j},H_{j})$ , where $Y_{j},H_{j}$ are distributed as the output of channel (3) with input $x_{j}$ . Define the Berry-Esseen ratio

[TABLE]

Then whenever $\sum_{j=1}^{n}\|x_{j}\|_{F}^{2}=nTP$ and $\max_{j}\|x_{j}\|_{F}\leq\delta n^{1\over 4}$ we have

[TABLE]

where $K_{1},K_{2},K_{3}>0$ are constants which only depend on channel parameters but not $x^{n}$ or $n$ .

The proof of Lemma 13 can be found in Appendix B.

III-D Hypothesis testing

Many finite blocklength results are derived by considering an optimal hypothesis between appropriate distributions. We define $\beta_{\alpha}(P,Q)$ to be the minimum error probability of all statistical tests $P_{Z|W}$ between distributions $P$ and $Q$ , given that the test chooses $P$ when $P$ is correct with at least probability $\alpha$ . Formally:

[TABLE]

The classical Neyman-Pearson lemma shows that the optimal test achieves

[TABLE]

where $dP\over dQ$ denotes the Radon-Nikodym derivative of $P$ with respect to $Q$ , and $\gamma$ is chosen to satisfy

[TABLE]

We recall a simple bound on $\beta_{\alpha}$ following from the data-processing inequality (see [1, (154)-(156)] or, in different notation, [27, (10.21)]):

[TABLE]

A more precise bound [1, (102)] is

[TABLE]

We will also need to define the performance of composite hypothesis tests. To this end, let $F\subset\mathcal{X}$ and $P_{Y|X}:\mathcal{X}\to\mathcal{Y}$ be a random transformation. We define

[TABLE]

We can lower bound the error in a composite hypothesis test $\kappa_{\tau}$ by the error in an appropriately chosen binary hypothesis test as follows:

Lemma 14.

For any $P_{\tilde{X}}$ on $\mathcal{X}$ we have

[TABLE]

Proof.

Let $P_{Z|Y}$ be any test satisfying conditions in the definition (149). We have the chain

[TABLE]

where (151) is from Fubini and (152) from constraints on the test. Thus $P_{Z|Y}$ is also a test satisfying conditions in the definition of $\beta_{\tau P_{\tilde{X}}[F]}$ . Optimizing over the tests completes the proof. ∎

IV Achievability

In this section, we prove the achievability side of the coding theorem for the MIMO-BF channel. We will rely on the $\kappa\beta$ bound [1, Theorem 25], quoted here:

Theorem 15 ( $\kappa\beta$ bound).

Given a channel $P_{Y|X}$ with input alphabet $\mathcal{A}$ and output alphabet $\mathcal{B}$ , for any distribution $Q_{Y}$ on $\mathcal{B}$ , any non-empty set $F\subset\mathcal{A}$ , and $\epsilon,\tau$ such that $0<\tau<\epsilon<1/2$ , there exists and $(M,\epsilon)$ -max code satisfying

[TABLE]

The art of applying this theorem is in choosing $F$ and $Q_{Y}$ appropriately. The intuition in choosing these is as follows: although we know the distributions in the collection $\{P_{Y|X=x}\}_{x\in F}$ , we do not know which $x$ is actually true in the composite, so if $Q_{Y}$ is in the “center” of the collection, then the two hypotheses can be difficult to distinguish, making the numerator large. However, for a given $x$ , $P_{Y|X=x}$ vs $Q_{Y}$ may still be easily to distinguish, making the denominator small. The main principle for applying the $\kappa\beta$ -bound is thus: Choose $F$ and $Q_{Y}$ such that $P_{Y|X=x}$ vs $Q_{Y}$ is easy to distinguish for any given $x$ , yet the composite hypothesis $Y\sim\{P_{Y|X=x}\}_{x\in F}$ is hard to distinguish from a simple one $Y\sim Q_{Y}$ .

The main theorem of this section gives achievable rates for the MIMO-BF channel, as follows:

Theorem 16.

Fix an arbitrary caid $P_{X}$ on $\mathbb{R}^{n_{t}\times T}$ and let

[TABLE]

where $V_{1}(x)$ is introduced in Proposition 8. Then we have

[TABLE]

with $C(P)$ given by (6).

Proof.

Let $\tau>0$ be a small constant (it will be taken to zero at the end). We apply the $\kappa\beta$ bound (153) with auxiliary distribution $Q_{Y}=(P_{Y,H}^{*})^{n}$ , where $P_{Y,H}^{*}$ is the caod (27), and the set $F_{n}$ is to be specified shortly. Recall notation $D_{n}(x^{n})$ , $V_{n}(x^{n})$ and $B_{n}(x^{n})$ from (53), (56) and (143). For any $x^{n}$ such that $B_{n}(x^{n})\leq\tau\sqrt{n}$ , we have from [28, Lemma 14],

[TABLE]

where $K^{\prime}$ is a constant that only depends on channel parameters. We mention that obtaining (156) from [28, Lemma 14] also requires that $V_{n}(x^{n})$ be bounded away from zero by a constant, which holds since in the expression for $V_{n}(x^{n})$ in Proposition 8, the term (58) is strictly positive, term (59) will vanish, and terms (60) and (61) are both non-negative.

Considering (156), our choice of the set $F_{n}$ should not be surprising:

[TABLE]

where $\delta=\delta(\tau)>0$ is chosen so that Lemma 13 implies $B_{n}(x^{n})\leq\tau\sqrt{n}$ for any $x^{n}\in F_{n}$ . Under this choice from (156), (54) and Lemma 13 we conclude

[TABLE]

where $K^{\prime\prime}=K^{\prime}+\log{1\over\tau}$ .

To lower bound the numerator $\kappa_{\tau}(F_{n},P_{Y,H}^{*n})$ we first state two auxiliary lemmas, whose proofs follow. The first, Lemma 17, shows that the output distribution induced by an input distribution that is uniform on the sphere is “similar” (in the sense of divergence) to the $n$ -fold product of the caod.

Lemma 17.

Fix an arbitrary caid $P_{X}$ and let $X^{n}$ have i.i.d. components $\sim P_{X}$ . Let

[TABLE]

where $\|X^{n}\|_{F}=\sqrt{\sum_{t=1}^{n}\|X_{j}\|_{F}^{2}}$ . Then

[TABLE]

where $P_{Y,H}^{*n}$ is the $n$ -fold product of the caod (27).

The second, Lemma 18, shows that a uniform distribution on the sphere has nearly all of its mass in $F_{n}$ as $n\to\infty$ .

Lemma 18.

With $\tilde{X}^{n}$ as in Lemma 17 and set $F_{n}$ defined as in (157) (with arbitrary $\tau>0$ and $\delta>0$ ) we have as $n\to\infty$ ,

[TABLE]

Denote the right-hand side of (160) by $K_{1}$ and consider the following chain:

[TABLE]

where (161) follows from Lemmas 14 and (147) with $P_{\tilde{X}^{n}}$ as in Lemma 17, (162) is from Lemma 17, (163) is from Lemma 18, and in (164) we introduced a $\tau$ -dependent constant $K_{2}$ .

Putting (158) and (164) into the $\kappa\beta$ -bound we obtain

[TABLE]

Taking $n\to\infty$ and then $\tau\to 0$ completes the proof. ∎

Now we prove the two lemmas used in the Theorem.

Proof of Lemma 17.

In the case of no-fading ( $H_{j}=1$ ) and SISO, this Lemma follows from [29, Proposition 2]. Here we prove the general case. Let us introduce an auxiliary channel acting on $X_{j}$ as follows:

[TABLE]

With this notation, consider the following chain:

[TABLE]

where (166) is by clear from (165), (167) follows since $P_{X}$ is a caid, (168)-(169) are standard identities for divergence, (170) follows since both $\tilde{Y}_{j}$ and $Y_{j}$ are unit-variance Gaussians and $D(\mathcal{N}(0,1)\|\mathcal{N}(a,1))={a^{2}\log e\over 2}$ , (171) is from Lemma 7 (see Remark 6) and (172) is just algebra along with the assumption that $\mathbb{E}\,[\|X^{n}\|_{F}^{2}]=nTP$ .

It remains to lower bound the expectation $\mathbb{E}\,[\|X^{n}\|_{F}]$ . Notice that for any uncorrelated random variables $B_{t}\geq 0$ with mean 1 and variance 2 we have

[TABLE]

which follows from $\sqrt{x}\geq{3x-x^{2}\over 2}$ for all $x\geq 0$ and simple computations. Next consider the chain:

[TABLE]

where in (176) we used the fact that for any caid, $\{(X_{t})_{i,j},t=1,\ldots n\}\sim\mathcal{N}(0,P/n_{t})$ i.i.d. (from Theorem 6) and applied (173) with $B_{t}={(X_{t})_{i,j}^{2}n_{t}\over P}$ . Putting together (172) and (176) completes the proof. ∎

Proof of Lemma 18.

Note that since $\|X^{n}\|_{F}^{2}$ is a sum of i.i.d. random variables, we have ${\|X^{n}\|_{F}\over\sqrt{nTP}}\to 1$ almost surely. In addition we have

[TABLE]

where we used the fact (Theorem 6) that $X_{1}$ ’s entries are Gaussian. Then we have from independence of $X_{j}$ ’s and Chebyshev’s inequality,

[TABLE]

as $n\to\infty$ . Consequently,

[TABLE]

as $n\to\infty$ .

Next we analyze the behavior of $V_{n}(\tilde{X}^{n})$ . From Proposition 8 we see that, due to $\|\tilde{X}^{n}\|_{F}^{2}=nTP$ , the term (59) vanishes, while (60) simplifies. Overall, we have

[TABLE]

where we replaced the terms that do not depend on $x^{n}$ with $K$ . Note that the first term in parentheses (premultiplying the sum) converges almost-surely to 1, by the strong law of large numbers. Similarly, the normalized sum converges to the expectation (also by the strong law of large numbers). Overall, applying the SLLN in the limit as $n\to\infty$ , we obtain:

[TABLE]

In particular, $\mathbb{P}[V_{n}(\tilde{X}^{n})\leq V^{\prime}+\tau]\to 1$ . This concludes the proof of $\mathbb{P}[\tilde{X}^{n}\in F_{n}]\to 1$ . ∎

V Converse

Here we state and prove the converse part of Theorem 1. There are two challenges in proving the converse relative to other finite blocklength proofs. First, behavior of the information density (III-C) varies widely as $x^{n}$ varies over the power-sphere

[TABLE]

Indeed, when $\max_{j}\|x_{j}\|_{F}\geq cn^{1\over 4}$ the distribution of information density ceases to be Gaussian. In contrast, the information density for the AWGN channel is constant over $S_{n}$ .

Second, assuming asymptotic normality, we have for any $x^{n}\in S_{n}$ :

[TABLE]

However, the problem is that $V_{n}(x^{n})$ is also non-constant. In fact there exists regions of $S_{n}$ where $V_{n}(x^{n})$ is abnormally small. Thus we need to also show that no capacity-achieving codebook can live on those abnormal sets.

The main theorem of the section is the following:

Theorem 19.

For any $\delta_{n}\to 0$ there exists $\delta^{\prime}_{n}\to 0$ such that any $(n,M,\epsilon)$ -max code with $\epsilon<1/2$ and codewords satisfying $\max_{1\leq j\leq n}\|x_{j}\|_{F}\leq\delta_{n}n^{1\over 4}$ has size bounded by

[TABLE]

where $C(P)$ and $V(P)$ are defined in (7) and (8), respectively.

Proof.

As usual, without loss of generality we may assume that all codewords belong to $S_{n}$ as defined in (180), see [1, Lemma 39]. The maximal probability of error code size is bounded by a meta-converse theorem [1, Theorem 31], which states that for any $(n,M,\epsilon)$ code and distribution $Q_{Y^{n}H^{n}}$ on the output space of the channel,

[TABLE]

where infimum is taken over all codewords. The main problem is to select $Q_{Y^{n}H^{n}}$ appropriately. We do this separately for the two subcodes defined as follows. Fix arbitrary $\delta>0$ (it will be taken to 0 at the end) and introduce:

[TABLE]

To bound the cardinality of $\mathcal{C}_{u}$ , we select $Q_{Y^{n}H^{n}}=(P_{Y,H}^{*})^{n}$ to be the $n$ -product of the caod (27), then apply the following estimate from [28, Lemma 14], quoted here: for any $\Delta>0$ we have

[TABLE]

where $D_{n}$ , $V_{n}$ and $B_{n}$ are given by (54), (56) and (143), respectively. We choose $\Delta=n^{1\over 4}$ and then from Lemma 13 (which relies on the assumption that $\|x_{j}\|_{F}\leq\delta n^{\frac{1}{4}}$ ) we get that for some constants $K_{1},K_{2}$ we have for all $x^{n}\in\mathcal{C}_{u}$ :

[TABLE]

From (182) and (185) we therefore obtain

[TABLE]

where $\delta^{\prime\prime}_{n}={K_{1}\delta_{n}^{2}+K_{2}n^{-{1\over 4}}}\to 0$ as $n\to\infty$ .

Next we proceed to bounding $|\mathcal{C}_{l}|$ . To that end, we first state two lemmas. Lemma 20 shows that, if in addition to the power constraint $\mathbb{E}[\|X\|_{F}^{2}]\leq TP$ , we also required $\mathbb{E}[V_{1}(X)]\leq V(P)-\delta$ , then the capacity of this variance-constrained channel is strictly less than without the latter constraint.

Lemma 20.

Consider the following constrained capacity:

[TABLE]

where $V(P)$ is from (8) and $V_{1}(x)$ is from (57). For any $\delta>0$ there exists $\tau=\tau(P,\delta)>0$ such that $\tilde{C}(P,\delta)<C(P)-\tau$ .

Remark 8.

Curiously, if we used constraint $\mathbb{E}\,[V_{1}(X)]>V(P)+\delta$ instead of $\mathbb{E}[V_{1}(X)]\leq V(P)-\delta$ in (187), then the resulting capacity equals $C(P)$ regardless of $\delta$ .

The following Lemma shows that, with the appropriate choice of an auxiliary distribution $Q_{Y^{n},H^{n}}$ , the expected size of the normalized log likelihood ratio is strictly smaller than capacity, while the variance of that same ratio is upper bounded by a constant (i.e. does not scale with $n$ ).

Lemma 21.

Define the auxiliary distribution

[TABLE]

where $A>1$ is a constant, $P^{*}_{Y|H}(y|h)$ is the caod for the MIMO-BF channel, and $\tilde{P}^{*}_{Y|H}(y|h)$ is the caod for the variance-constrained channel in (187). Let $Q_{Y,H}=P_{H}Q_{Y|H}$ , and $Q_{Y^{n},H^{n}}=\prod_{i=1}^{n}Q_{Y,H}$ . Then there exists constants $\tau,K>0$ such that for all $x^{n}\in\mathcal{C}_{l}$ ,

[TABLE]

where $Y_{i}=H_{i}x_{i}+Z_{i}$ , $i=1,\ldots,n$ is the joint distribution.

Remark 9.

The reason we let $Q_{Y|H}$ take on two distributions depending on the value of $H$ is because we do not know the form of $\tilde{P}^{*}_{Y|H}$ , hence we do not explicitly know how it depends on $H$ . This choice of $Q_{Y|H}$ ensures that expectations involving $\tilde{P}^{*}_{Y|H}$ are finite.

Choose $Q_{Y,H}$ as in Lemma 21, so that the bounds on $C_{n}$ , $V_{n}$ from (189), (190) respectively, hold. Applying [28, Lemma 15] with $\alpha=1-\epsilon$ (the statement of this lemma is the contents of (191)), we obtain

[TABLE]

Therefore, from (182) we conclude that for all $n\geq n_{0}(\delta)$ we have

[TABLE]

Overall, from (186) and (193) we get (due to arbitrariness of $\delta$ ) the statement (181). ∎

Proof of Lemma 20.

Introduce the following set of distributions:

[TABLE]

By Prokhorov’s criterion (e.g. [30, Theorem 5.1], tightness implies relative compactness), the norm constraint implies that this set is relatively compact in the topology of weak convergence. So there must exist a sequence of distributions $\tilde{P}_{n}\in\mathcal{P}^{\prime}$ s.t. $\tilde{P}_{n}\stackrel{{\scriptstyle w}}{{\to}}\tilde{P}$ and $I(\tilde{X}_{n};H\tilde{X}_{n}+Z|H)\to\tilde{C}(P,\delta)$ where $\tilde{X}_{n}\sim\tilde{P}_{n}$ . By Skorokhod representation [30, Theorem 6.7], we may assume $\tilde{X}_{n}\stackrel{{\scriptstyle a.s.}}{{\to}}\tilde{X}\sim\tilde{P}$ , i.e. there exists random variable $\tilde{X}$ that is the pointwise limit of the $\tilde{X}_{n}$ ’s. Notice that for any continuous bounded function $f(h,y)$ we have

[TABLE]

and therefore $P_{\tilde{Y}_{n},H}\stackrel{{\scriptstyle w}}{{\to}}P_{\tilde{Y},H}$ . Assume (to arrive at a contradiction) that $\tilde{C}(P,\delta)=C(P)$ , then by the golden formula, cf. [25, Theorem 3.3], we have

[TABLE]

where $D_{1}(x)$ is from (54). Therefore, we have

[TABLE]

From weak lower-semicontinuity of divergence [25, Theorem 3.6] we have $D(P_{\tilde{Y},H}\|P_{Y,H}^{*})=0$ . In particular, if we denote $X^{*}$ to have Telatar distribution (26), we must have

[TABLE]

From Lemma 7 (see Remark 6) we have

[TABLE]

and hence from the independence of $Z$ from $(H,X)$ we get

[TABLE]

and similarly for the right-hand side of (198). We conclude that

[TABLE]

Finally, plugging this fact into the expression for $D_{1}(x)$ in (54) and (196) we obtain

[TABLE]

That is, $\tilde{X}$ is a caid. But from Fatou’s lemma we have (recall that $V_{1}(x)\geq 0$ since it is a variance)

[TABLE]

where the last step follows from $\tilde{P}_{n}\in\mathcal{P}^{\prime}$ . A caid achieving conditional variance strictly less than $V(P)$ contradicts the definition of $V(P)$ , cf. (8), as the infimum of $\mathbb{E}\,[V_{1}(X)]$ over all caids. ∎

Proof of Lemma 21.

First we analyze $C_{n}$ from (189). Denote

[TABLE]

Here, $i(x;y,h)$ is the information density given by (III-C), while $\tilde{i}(x;y,h)$ instead has the caod for the variance-constrainted channel (187) in the denominator. Since $Q_{Y|H}$ takes on one of two distributions based on the value of $H$ , conditioning on $H$ in two ways yields

[TABLE]

The $H_{j}$ ’s are i.i.d. according to $P_{H}$ , so we define $p\triangleq\mathbb{P}[\|H_{j}\|_{F}^{2}>A]$ . Using capacity saddle point, (203) is bounded by

[TABLE]

where $C(P_{H})$ denotes the capacity of the MIMO-BF channel with fading distribribution $P_{H}$ , and $P_{H>A}$ denotes the distribution of $H$ conditioned on $\|H\|_{F}^{2}>A$ (similarly, $P_{H\leq A}$ will denote $H$ conditioned on $\|H\|_{F}^{2}\leq A$ ). (205) follows from the fact that the information density, i.e. $\log\frac{P_{Y|H,X}}{P^{*}_{Y|H}}(y|h,x)$ , is not a function of $P_{H}$ , hence changing the distribution $P_{H}$ does not affect the form of $i(x;y,h)$ . Similarly, using Lemma 20, (204) is bounded by

[TABLE]

where $\tau^{\prime}>0$ is a positive constant, and $\tilde{C}(P_{H})$ denotes the solution to the optimization problem (187) when the fading distribution is $P_{H}$ . Putting together (205) and (207), we obtain an upper bound on $C_{n}$ ,

[TABLE]

Note that $C(P_{H})=\mathbb{E}_{P_{H}}\left[\log\det(I_{n_{r}}+P/n_{t}HH^{T})\right]$ , so the capacity only depends on $P_{H}$ through the expectation – the expression inside is not a function of $P_{H}$ because the i.i.d. Gaussian caid achieves capacity for all isotropic $P_{H}$ ’s. Hence, by the law of total expectation, (208) simplifies to

[TABLE]

Finally, we can upper bound $p$ using Markov’s inequality as

[TABLE]

since $A>1$ . Applying this bound to (209), we obtain

[TABLE]

Defining $\tau\triangleq(1-1/A)\tau^{\prime}$ completes the proof of (189).

Next we analyze $V_{n}$ from (190). The strategy will be to decompose (190) into two terms depending on the value of $\|H\|_{F}^{2}$ , then show that each term is upper bounded by $A_{1}+A_{2}\sum_{j=1}^{n}\|x_{j}\|_{F}^{4}$ , where $A_{1},A_{2}$ are constants not depending on $x^{n}$ . Finally, we will show that $\sum_{j=1}^{n}\|x_{j}\|_{F}^{4}=O(n)$ when $x^{n}\in\mathcal{C}_{l}$ . To this end,

[TABLE]

where (214) follows from the independence of the terms, and (215) is from the bound $\mathrm{Var}(X)\leq\mathbb{E}[X^{2}]$ . Again we condition on $H$ in two ways,

[TABLE]

For the first term, (216), we know the expression for $i(x;y,h)$ from (III-C), so we simply upper bound $i(x;y,h)^{2}$ . To this end,

[TABLE]

where $C_{1},C_{2}$ are non-negative constants, and $C_{3}(\tilde{z}_{j}),C_{4}(\tilde{z_{j}})$ are functions of only $\tilde{z}_{j}$ that have bounded moments. This follows from:

•

Bounding the first term via

[TABLE]

which can be derived from the basic inequality $\log(1+x)\leq\log(e)\sqrt{x}$ .

•

Noting that the second term is bounded in $h$ , since for all $\lambda\in\mathbb{R}$ ,

[TABLE]

•

Noting that all moments of $\|\tilde{z}_{j}\|^{2}$ are finite because this is the norm of a standard normal vector.

Therefore, after taking the expectation of (219) and summing over all $n$ , we obtain

[TABLE]

for some non-negative constants $C_{5},C_{6}$ .

To bound the second term, (217), first we split the logarithm as

[TABLE]

The first term in (225) is simple to handle, since its expression is given by the definition of the channel,

[TABLE]

i.e. we have a constant upper bound. For the second term in (225), notice that $\tilde{P}^{*}_{Y,H}$ that is inducible through channel, i.e. there exists an input distribution $P_{X}$ such that $\tilde{P}^{*}_{Y,H}(y,h)=\mathbb{E}[P_{Y,H|X}(y,h|X)]$ . Using this fact, we obtain the bound

[TABLE]

where (230) follows from Jensen’s inequality, (231) is from the definition of the channel, and (232) follows from applying the inequality $\|A+B\|_{F}^{2}\leq 2\|A\|_{F}^{2}+2\|B\|_{F}^{2}$ along with $\|hX\|_{F}^{2}\leq\|h\|_{F}^{2}\|X\|_{F}^{2}$ , then noting that $X$ satisfies $\mathbb{E}[\|X\|_{F}^{2}]=TP$ . Using this, we can bound the second term in (225) via

[TABLE]

where $K_{2},K_{3}$ are non-negative constants which do not depend on $x$ , (234) is from the above bound (232), and (236) follows from applying the bound

[TABLE]

Putting together (236) and (228), we obtain an upper bound on (217),

[TABLE]

Now, since $x^{n}\in\mathcal{C}_{l}$ by assumption, we can control the quantity $\sum_{i=1}^{n}\|x_{i}\|_{F}^{4}$ via

[TABLE]

where the first inequality follows from the non-negativity of the terms in $V_{1}(x)$ given in Proposition 8, and the second inequality is from the definition of $\mathcal{C}_{l}$ . Hence the sum of fourth powers of the $\|x_{i}\|_{F}$ ’s is $O(n)$ on $\mathcal{C}_{l}$ . All together, combining (240) and (223) yields the following bound on $V_{n}$ ,

[TABLE]

which completes the proof of (190). ∎

VI The rank 1 case

When $H$ is rank 1, for example in the MISO case, i.e. $n_{t}>n_{r}=1$ , the MIMO-BF channel has multiple input distributions that achieve capacity, as shown in Theorem 6. Theorem 1 proved that the dispersion in the general MIMO-BF channel is given by (8), where we minimize the conditional variance of the information density over the set of caids. In this section, we analyze those minimizers for the rank 1 case, which turns out to be non-trivial.

From Theorem 3, when $H$ is rank 1, the conditional variance takes the form

[TABLE]

where $K_{1},K_{2}>0$ are constants that depend on the channel parameters but not the input distribution. From (18), computing $v^{*}(n_{t},T)$ requires us to maximize the variance of the squared Frobenius norm of the input distribution over the set of caids. Intuitively, this says that minimizing the dispersion is equivalent to maximizing the amount of correlation amongst the entries of $X$ when $X$ is jointly Gaussian. In a sense, this asks for the capacity achieving input distribution having the least amount of randomness.

Here we characterize $v^{*}(n_{t},T)$ . The manifold of caids is not easy to optimize over, since one must account for all the independence constraints on the rows and columns, the covariance constraints on the $2\times 2$ minors, positive definite constraints, etc. as described in Theorem 6. Our strategy instead will be to give an upper bound on $v^{*}(n_{t},T)$ , then show that for certain pairs $(n_{t},T)$ , the upper bound is tight. Before stating the main theorem of the section, we review orthogonal designs, which will play a large role in the solution to this problem.

VI-A Orthogonal designs

Definition 1 (Orthogonal Design).

A real $n\times n$ orthogonal design of size $k$ is defined to be an $n\times n$ matrix $A$ with entries given by linear forms in $x_{1},\ldots,x_{k}$ and coefficients in $\mathbb{R}$ satisfying

[TABLE]

In other words, all columns of $A$ have squared Euclidean norm $\sum_{i=1}^{k}x_{i}^{2}$ , and all columns are pairwise orthogonal. A common representation for an orthogonal design is the sum $A=\sum_{i=1}^{k}x_{i}V_{i}$ where $\{V_{1},\ldots,V_{k}\}$ is a collection of $n\times n$ real matrices satisfying Hurwitz-Radon conditions (19)-(20). Such collection is called a Hurwitz-Radon family. Theorem 4 shows that the maximal cardinality of a Hurwitz-Radon family is the Hurwitz-Radon number $\rho(n)$ , cf. (21).

The definition of orthogonal designs can be generalized to rectangular matrices [9], as follows:

Definition 2 (Generalized Orthogonal Design).

A generalized orthogonal design is a $p\times n$ matrix $A$ with $p\geq n$ with entries as linear forms of the indeterminates $\{x_{1},\ldots,x_{k}\}$ satisfying (246).

The quantity $R=k/p$ is often called the rate of the generalized orthogonal design. This term is justify by noticing that if $p$ represents a number channel uses and $k$ represents the number of data symbols, then $R$ represents sending $k$ data symbols in $p$ channel uses. In this work, we are only interested in the case $R=1$ (i.e. $k=p$ ), called full-rate orthogonal designs. Full-rate orthogonal design can be constructed from a Hurwitz-Radon family $\{V_{1},\ldots,V_{n}\}$ , each $V_{i}\in\mathbb{R}^{k\times k}$ by forming the matrix $A$

[TABLE]

where $x=[x_{1},\ldots,x_{k}]^{T}$ is the vector of indeterminates. It follows immediately from this construction that (246) is satisfied. Theorem 4 allows us to conclude that a generalized full rate $n\times k$ orthogonal design exists if and only if $n\leq\rho(k)$ .

The following proposition shows that full rate orthogonal designs correspond to caids in the MIMO-BF channel.

Proposition 22.

Take $n_{t}=\rho(T)$ and a maximal Hurwitz-Radon family $\{V_{i},i=1,\ldots,n_{t}\}$ of $T\times T$ matrices (cf. Theorem 4). Let $\xi\sim\mathcal{N}(0,P/n_{t}I_{T})$ be an i.i.d. row-vector. Then the input distribution

[TABLE]

achieves capacity for any MIMO-BF channel provided $\mathbb{P}[\mathop{\rm rank}H\leq 1]=1$ .

Proof.

Since $\{V_{1},\ldots,V_{n_{t}}\}$ is a Hurwitz-Radon family, they satisfy (19)-(20). Form $X$ as in (248). Then each row and column is jointly Gaussian, and applying the caid conditions (31) and (32) from Theorem 6 shows,

[TABLE]

Therefore $X$ satisfies the caid conditions, and hence achieves capacity. ∎

Remark 10.

The above argument implies that if $X\in\mathbb{R}^{n_{t}\times T}$ is constructed above, then removing the last row of $X$ gives an $(n_{t}-1)\times T$ input distribution that also achieves capacity.

VI-B Proof of theorem 5

Theorem 5 states that for dimensions where orthogonal designs exist, the conditional variance (8) is minimized if and only if the input is constructed from an orthogonal design as in Proposition 22. The approach is first to prove an upper bound on $v^{*}$ , then show that conditions for tightness of the upper bound correspond to conditions of the Hurwitz-Radon theorem.

We start with a simple lemma, which will be applied with $A,B$ equal to the rows of the capacity achieving input $X$ .

Lemma 23.

Let $A=(A_{1},\ldots,A_{n})$ and $B=(B_{1},\ldots,B_{n})$ each be i.i.d. random vectors from the same distribution with finite second moment $\mathbb{E}[A_{1}^{2}]=\sigma^{2}<\infty$ . While $A$ and $B$ are i.i.d. individually, they may have arbitrary correlation between them. Then

[TABLE]

with equality iff $\sum_{i=1}^{n}A_{i}=\sum_{i=1}^{n}B_{i}$ almost surely.

Proof.

Simply use the fact that covariance is a bilinear function, and apply the Cauchy-Schwarz inequality as follows:

[TABLE]

We have equality in Cauchy-Schwarz when $\sum_{i=1}^{n}A_{i}$ and $\sum_{i=1}^{n}B_{i}$ are proportional, and since these sums have the same distribution, the constant of proportionality must be equal to 1, so we have equality in (251) iff $\sum_{i=1}^{n}A_{i}=\sum_{i=1}^{n}B_{i}$ almost surely. ∎

Proof of Theorem 5.

First, we rewrite $v^{*}(n_{t},T)$ defined in (18) as

[TABLE]

From here, $v^{*}(n_{t},T)=v^{*}(T,n_{t})$ follows from the symmetry to transposition of the caid-conditions on $X$ (see Theorem 6) and symmetry to transposition of (256). From now on, without loss of generality we assume $n_{t}\leq T$ .

For the upper bound, since the rows and columns of $X$ are i.i.d., we can apply Lemma 23 with $A_{i}=X_{i,k}^{2}$ and $B_{j}=X_{j,l}^{2}$ (and hence $\sigma^{2}=2(P/n_{t})^{2}$ ) to get

[TABLE]

which together with (256) yields the upper bound (22) (recall that $n_{t}\leq T$ ).

Equation (257) implies that if $X$ achieves the bound (22), then removing the last row of $X$ achieves (22) as an $(n_{t}-1)\times T$ design. In other words, if (22) is tight for $n_{t}\times T$ then it is tight for all $n_{t}^{\prime}\leq n_{t}$ .

Notice that for any $X$ such that any pair $X_{i,k}$ , $X_{j,l}$ is jointly Gaussian, we have

[TABLE]

where

[TABLE]

Take $X\in\mathbb{R}^{n_{t}\times T}$ as constructed in (248). By Proposition 22, $X$ is capacity achieving and identity (258) clearly holds. In the representation (248), the matrix $V_{j}^{T}V_{i}$ contains the correlation coefficients between rows $i$ and $j$ of $X$ , since $\mathbb{E}[(\xi V_{j})^{T}(\xi V_{i})]=\frac{P}{n_{t}}V_{j}^{T}V_{i}$ , so

[TABLE]

Therefore we can represent the sum of squared correlation coefficients as

[TABLE]

Line (264) follows since the $V_{i}$ ’s are orthogonal by the Hurwitz-Radon condition, so each $V_{i}V_{i}^{T}=I_{T}$ in the summation in (263). Hence the $X$ constructed in (248) achieves the upper bound in (257) and (22).

Next we prove (24). Suppose $X$ is a jointly-Gaussian caid saturating the bound (257). From Lemma 23, the condition for equality in (251) implies that for all $j\in\{1,\ldots,n_{t}\}$ ,

[TABLE]

where $R_{j}$ is the $j$ -th row of $X$ for $j=1,\ldots,n_{t}$ . In particular, this means that every $R_{j}$ is a linear function of $R_{1}$ . Consequently, we may represent $X$ in terms of a row-vector $\xi\sim\mathcal{N}(0,P/n_{t}I)$ as in (248), that is $R_{j}=\xi V_{j}$ for some $T\times T$ matrices $V_{j},j\in[n_{t}]$ . We clearly have

[TABLE]

But then the caid constraints (31)-(32) imply that the matrix $A$ in (247) constructed using indeterminates $\{x_{1},\ldots,x_{n_{t}}\}$ and family $\{V_{1},\ldots,V_{n_{T}}\}$ satisfies Definition 2. Therefore, from Theorem 4, (see also [31, Proposition 4]), we must have $n_{T}\leq\rho(T)$ . ∎

Remark 11.

In the case $n_{t}=T=2$ it is easy to show that for any non-jointly-Gaussian caid, there exists a jointly-Gaussian caid achieving the same $\mathrm{Var}(\|X\|_{F}^{2})$ . Indeed, consider (39) with $\rho={\mathop{\rm cov}(X_{1,1}^{2},X_{2,2}^{2})+\mathop{\rm cov}(X_{1,2}^{2},X_{2,1}^{2})\over 8(P/n_{t})^{2}}$ . If this phenomena held in general, we would conclude that (23) holds if and only if $n_{t}\leq\rho(T)$ or $T\leq\rho(n_{t})$ . As a step towards the proof of the latter, we notice that any caid $X$ achieving equality in (257) satisfies

[TABLE]

which is equivalent to saying $R_{i}R_{j}^{\prime}=0$ for $i\neq j$ . The latter follows from applying (265) to rows of $UX$ , where $U$ is an arbitrary orthogonal matrix. Identity (266) could be informally stated as “any caid saturating (257) is a random full-rate orthogonal design”.

In summary, the full-rate orthogonal designs (when those exist) achieve the optimal channel dispersion $V(P)$ . Some examples ( $\xi_{j}$ are i.i.d. $\mathcal{N}(0,1)$ ) for $n_{t}=T=4$ and $n_{t}=4,T=3$ , respectively, are as follows:

[TABLE]

VI-C Beyond full-rate orthogonal designs

For pairs $(n_{t},T)$ where $n_{t}>\rho(T)$ , full-rate orthogonal design do not exist. For example $\rho(3)=1$ , so no full-rate orthogonal design exits for $n_{t}=2$ , $T=3$ . Which caids are minimizer for (8) in this case? In general, we do not know the answer and do not even know whether one can restrict the search to jointly-Gaussian caids. But one thing is certain: it is definitely not an i.i.d. Gaussian (Telatar) caid. To show this claim, we will give a method for constructing improved caids.

To that end, suppose that $X$ consists of entries $\pm\xi_{j}$ , $j=1\ldots,d$ , where $\xi_{j}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,P/n_{t})$ . Then we have:

[TABLE]

where $\ell_{t}$ is the number of times $\pm\xi_{t}$ appears in the description of $X$ . By this observation and the remark after Theorem 6 (any submatrix of a caid $X$ is also a caid), we can obtain lower bounds on $v^{*}(n_{t},T)$ for $n_{t}>\rho(T)$ via the following truncation construction:

Take $T^{\prime}>T$ such that $\rho(T^{\prime})\geq n_{t}$ and let $X^{\prime}$ be a corresponding $\rho(T^{\prime})\times T^{\prime}$ full-rate orthogonal design with entries $\pm\xi_{1},\ldots\pm\xi_{T^{\prime}}$ . 2. 2.

Choose an $n_{t}\times T$ submatrix of $X^{\prime}$ maximizing the sum of squares of the number of occurrences of each of $\xi_{j}$ , cf. (277).

As an example of this method, by truncating a $4\times 4$ design (271) we obtain the following $2\times 3$ and $3\times 3$ submatrices:

[TABLE]

By independent methods we were able to show that designs (283) are dispersion-optimal out of all jointly Gaussian caids. Note that in these cases (23) does not hold, illustrating (24).

Our current knowledge about $v^{*}$ is summarized in Table I. The lower bounds for cases not handled by Theorem 5 were computed by truncating the 8x8 orthogonal design [9, (5)]. Based on the evidence from $2\times T$ and $3\times 3$ we conjecture this construction to be optimal.

From the proof of Theorem 5 it is clear that Telatar’s i.i.d. Gaussian is never dispersion optimal, unless $n_{t}=1$ or $T=1$ . Indeed, for Telatar’s input $\rho_{ikjl}=0$ unless $(i,k)=(j,l)$ . Thus embedding even a single $2\times 2$ Alamouti block into an otherwise i.i.d. $n_{t}\times T$ matrix $X$ strictly improves the sum (256). We note that the value of ${V\over C^{2}}$ entering (2) can be quite sensitive to the suboptimal choice of the design. For example, for $n_{t}=T=8$ and $SNR=20~{}dB$ estimate (2) shows that one needs

•

around 600 channel inputs (that is 600/8 blocks) for the optimal $8\times 8$ orthogonal design, or

•

around 850 channel inputs for Telatar’s i.i.d. Gaussian design

in order to achieve 90% of capacity. This translates into a 40% longer delay or battery spent in running the decoder.

Thus, curiously even in cases where pure multiplexing (that is maximizing transmission rate) is needed – as is often the case in modern cellular networks – transmit diversity enters the picture by enhancing the finite blocklength fundamental limits. Remember, however, that our discussion pertains only to cases when the transmitter (base-station) is equipped with more antennas than the receiver (user equipment), or when the channel does not have more than one diversity branch.

In cases when full-rate designs do not exist, there have been various suggestions as to what could be the best solution, e.g. [31]. Thus for non full-rate designs the property of minimizing dispersion (such as (283)) could be used for selecting the best design for cases $n_{t}>\rho(T)$ .

VII Discussion

Figure 1 plots the capacity, normal approximation, and $\beta\beta$ achievability bound for the MIMO channel with $n_{t}=n_{r}=T=4$ for the complex case. The details of this computation are given in [19]. The $\beta\beta$ bound was developed by Yang et al [19] and is often more computationally friendly than the $\kappa\beta$ bound. This figure illustrates the gap between achievability and the normal approximation, as well as the gap to capacity. For example, at blocklength 400, we can achieve about 88% of capacity, and at blocklength 1000 we can achieve about about 92% of capacity, given $P=0$ dB and tolerating an error probability of $10^{-3}$ .

Figure 2 shows the dependence of the rate on the coherence time $T$ for the $4\times 4$ MIMO channel. The normal approximation for $T=1,20,80$ is plotted. From (6) and (12), we know the capacity does not depend on $T$ , but the dispersion depends on $T$ in an affine relationship. Hence, from the dispersion we see that a larger coherence time reduces the maximum transmission rate when the other channel parameters are held fixed. Intuitively, when the coherence time is lower, we are able to average over independent realizations of the fading coefficients in less channel uses. Note that the CSIR assumption implies that we know the channel coefficient perfectly, which may be unrealistic at short coherence times for a practical channel.

We now ask: how does the dispersion depend on the number of transmit and receive antennas? Figures 3 and 4 depict the normalized dispersion $V/C^{2}$ , cf. (2), as a function of the number of antennas. The fading process is chosen to be i.i.d. $\mathcal{N}(0,1)$ . Each plot has two curves: one curve with $n_{r}$ fixed and $n_{t}$ growing, and the other curve with $n_{t}$ fixed and $n_{r}$ growing. In both plots, coherence time is $T=16$ . The difference is that on Fig. 3 the received power $P_{r}$ is held fixed (at 20 dB, i.e. $P$ is chosen so that $P_{r}=100$ ), whereas on Fig 4 it is the transmit power $P$ that is held fixed (also at $20~{}dB$ , i.e. $P=100$ ). The relation between $P_{r}$ and $P$ is as follows:

[TABLE]

These figures also display the asymptotic limiting values of $V\over C^{2}$ computed via random-matrix theory:

When $n_{r}$ is fixed and $n_{t}\to\infty$ under fixed received power $P_{r}$ we have

[TABLE] 2. 2.

When $n_{t}$ is fixed and $n_{r}\to\infty$ under fixed received power $P_{r}$ we have

[TABLE] 3. 3.

When $n_{r}$ is fixed and $n_{t}\to\infty$ under fixed transmitted power $P$ we have

[TABLE] 4. 4.

When $n_{t}$ is fixed and $n_{r}\to\infty$ under fixed transmit power $P$ we have

[TABLE]

Note that when the received power is fixed, reciprocity holds: the capacity of the $n_{t}\times n_{r}$ channel is the same as the capacity of the $n_{r}\times n_{t}$ one. Having information about dispersion, we may ask the more refined question: although capacities of the channels are the same, which one has better dispersion (i.e. causes smaller coding latency)?

From approximations (286) and (288), we can see that the channel dispersion is not symmetric in $n_{t},n_{r}$ . For example, in the setting of Fig. 3 we see that the delay penalty in the $n_{t}\ll n_{r}$ regime is $58\%$ of the penalty in the $n_{r}\ll n_{t}$ regime. Hence, in a two user channel, if user 1 has $n_{1}$ antennas and user 2 has $n_{2}>n_{1}$ antennas, then the asymptotic analysis suggest that channel from user 1 to user 2 can support higher rates than the channel from user 2 to user 1 at finite blocklength.

Figure 4 shows the scenario where the transmit power is fixed. In this case, the capacity approaches a finite limit when $n_{r}$ is held fixed and $n_{t}\to\infty$ , but grows logarithmically when $n_{t}$ is fixed and $n_{r}\to\infty$ , as shown in equations (289) and (291). In this setting, the normalized dispersion approaches a finite limit when $n_{r}$ is fixed and $n_{t}\to\infty$ , yet it vanishes when $n_{t}$ is fixed and $n_{r}\to\infty$ . Consequently in this regime, we can always choose the number of receive antennas $n_{r}$ large enough so that our system can achieve a given fraction of capacity $\eta$ using blocklength $n$ . The normalized dispersion in this case is proportional to $1/\log^{2}(n_{r})$ .

Appendix A Existence of non-Gaussian caids

Proposition 24.

Let $S\subset\mathbb{R}^{n}$ be such that a) $0\in S$ and b) there exists a non-zero polynomial in $n$ variables with real coefficients vanishing on $S$ . Then there exists a random variable $X$ taking values in $\mathbb{R}^{n}$ with the property that its characteristic function $\Psi(t)\stackrel{{\scriptstyle\triangle}}{{=}}\mathbb{E}\,[e^{i\sum_{k=1}^{n}t_{j}X_{j}}],t\in\mathbb{R}^{n}$ satisfies

[TABLE]

but there exist a $t_{0}\in\mathbb{R}^{n}$ such that $\Psi(t_{0})\neq e^{-{\|t_{0}\|_{2}^{2}\over 2}}$ (i.e. $X\not\sim\mathcal{N}(0,I_{n})$ ).

Remark 12.

The simplest application of this proposition is the following. Suppose that three random vectors in $\mathbb{R}^{3}$ have the property that projection onto any (2-dimensional) plane has the joint distribution $\mathcal{N}(0,I_{2})\times\mathcal{N}(0,I_{2})\times\mathcal{N}(0,I_{2})$ . Does it imply that the joint distribution of them is $\mathcal{N}(0,I_{3})\times\mathcal{N}(0,I_{3})\times\mathcal{N}(0,I_{3})$ ? Note that it is easy to argue that joint distribution of any pair of them is indeed $\mathcal{N}(0,I_{3})\times\mathcal{N}(0,I_{3})$ and thus the only jointly Gaussian distribution that satisfies the requirements is indeed the i.i.d. triplet. However, the above proposition shows that the general answer is still negative. Here $S$ is a subset of all $\mathbb{R}^{3\times 3}$ with determinant zero.

Proof.

We will slightly extend the argument of [32]. We will assume familiarity with basic commutatitive algebra on the level of [33]. Consider an identity expressing the well-known computation of the Gaussian characteristic function:

[TABLE]

Setting $\beta={1\over\alpha^{2}}$ , changing sign of $t$ we get

[TABLE]

Differentiating this in $\beta$ and setting $\beta={1\over 2}$ we get

[TABLE]

where $p_{2k}(t)$ is some polynomial of degree $2k$ with real coefficients (and involving only even powers of $t$ ). For later convenience, we also interchange $t$ and $x$ to get

[TABLE]

(Identity (293) also follows from the fact that Hermite polynomials times Gaussian density are eigenfunctions of the Fourier transform.)

Next, suppose that there is a polynomial $q(t_{1},\ldots,t_{n})$ such that $q$ vanishes on $S$ and each monomial $t_{1}^{k_{1}}\cdots t_{n}^{k_{n}}$ in $q$ has all $k_{1},\ldots,k_{n}$ even. Then, define the characteristic function

[TABLE]

We will argue that for $\epsilon$ sufficiently small, $\Psi$ is a characteristic function of some (obviously non-Gaussian) probability density function $f$ on $\mathbb{R}^{n}$ . By taking the inverse Fourier transform we get that

[TABLE]

where $e^{-{\|x\|_{2}^{2}\over 2}}g(x)$ is the inverse Fourier transform of the second term in (294). Since $\Psi(t)$ is even in each $t_{j}$ , we conclude that $f(x)$ is real. Since $q(0)=0$ (recall that $0\in S$ ) we have $\Psi(0)=1$ , and thus $\int_{\mathbb{R}^{n}}f=1$ . So to prove that $f$ is a valid density function for small $\epsilon$ we only need to show that

[TABLE]

To that end, notice that applying (293) to each monomial $\prod t_{j}^{2k_{j}}$ we get

[TABLE]

Multiplying the right-hand side by $e^{{\|x\|_{2}^{2}\over 2}}$ we conclude that contribution of each monomial of $q$ to $\sup_{x}|g(x)|$ is bounded by

[TABLE]

Since there are finitely many monomials in $q$ , the proof of (295) and of validity of $\Psi(t)$ is done.

We are left to argue that there must necessarily exist polynomial $q$ with required properties. By assumption there exist some other polynomial $q_{0}$ vanishing on $S$ . Consider an inclusion of rings

[TABLE]

where $\mathbb{R}[x_{1},\ldots,x_{n}]$ denotes the ring of polynomials with variables $x_{1},\ldots,x_{n}$ and coefficients in $\mathbb{R}$ , and $\hookrightarrow$ denotes an inclusion map. This morphism of rings is obviously finite. Consider ideal $(q_{0})$ of $\mathbb{R}[x_{1},\ldots,x_{n}]$ and denote as usual by $(q_{0})^{c}\stackrel{{\scriptstyle\triangle}}{{=}}(q_{0})\cap T$ its contraction. We argue that $(q_{0})^{c}\neq(0)$ . Assume otherwise, then we have $(q_{0})^{c}=(0)$ and $\sqrt{(q_{0})}^{c}=(0)$ (since $\sqrt{(0)}=(0)$ as $T$ is an integral domain). Take all minimal primes of $(q_{0})$ , call these $\{\mathfrak{p}_{j}\}$ , then the radical of $(q_{0})$ is the intersection of all prime ideals that contain it, i.e. $\sqrt{(q_{0})}=\cap_{j}\mathfrak{p}_{j}$ . Then, denoting $\mathfrak{q}_{j}\stackrel{{\scriptstyle\triangle}}{{=}}\mathfrak{p}_{j}^{c}$ we get that $\cap_{j}\mathfrak{q}_{j}=(0)$ in $T$ . By “prime-avoidance”, cf. [33, Prop. 1.11], we know $(0)\subset\cap_{j}\mathfrak{q}_{j}$ implies that $\mathfrak{q}_{j}\subset(0)$ for some $j$ , hence $\mathfrak{q}_{j}$ is the zero ideal for some $j$ . This contradicts the “going-up theorem”, cf. [33, Corollary 5.9], so we must have $(q_{0})^{c}\neq(0)$ , and hence we may take $q$ as an arbitrary non-zero element of $(q_{0})^{c}$ . ∎

Appendix B Analysis of the Berry-Esseen constant

Proof of Lemma 13.

We begin with upper bounding the numerator in (143), i.e.

[TABLE]

The information density is given by

[TABLE]

where

[TABLE]

Define $W=i(x;Y,H)$ under the distribution $Y=Hx+Z$ . (298) reduces to

[TABLE]

where the scalar random variable

[TABLE]

is the sum of all the terms that do not depend on $x$ . Note that

[TABLE]

Therefore, the “centered” information density is

[TABLE]

where

[TABLE]

Hence we can upper bound the centered third moment as

[TABLE]

We now proceed to upper bound each term individually. First $S_{2}$ ,

[TABLE]

where

•

(313) follows since $H^{T}\Sigma^{-1}H$ is PSD, and $\mathbb{E}[H^{T}\Sigma^{-1}H]$ is also PSD as a non-negative combination of PSD matrices, so that both $x^{T}H^{T}\Sigma^{-1}Hx$ and $x^{T}\mathbb{E}[H^{T}\Sigma^{-1}H]x$ are non-negative

•

(314) follows since $H^{T}\Sigma^{-1}H=VDV^{T}$ where

[TABLE]

and $D\leq\frac{n_{t}}{P}I_{n_{t}}$ in the PSD ordering, so

[TABLE]

and

[TABLE]

Now we bound $S_{3}$ from (310),

[TABLE]

where

•

In (321), define $\tilde{x}=V^{T}x$ and expand the trace.

•

(322) follows from the triangle inequality, along with $|\sum_{i=1}^{n}a_{i}|^{3}\leq n^{2}\sum_{i=1}^{n}|a_{i}|^{3}$ .

•

(323) we have used $\mathbb{E}[|Z|^{3}]\leq 2$ for $Z\sim\mathcal{N}(0,1)$ along with the bound

[TABLE]

Now notice that

[TABLE]

which can be viewed as the norm inequality $\|a\|_{3}\leq\|a\|_{2}$ for $a\in\mathbb{R}^{d}$ . Finally, we use $\|V^{T}x\|_{F}^{2}=\|x\|_{F}^{2}$ for any orthogonal matrix $V$ .

For the denominator in (143), the expression for $\frac{1}{T}\mathrm{Var}(W_{j})$ is given in (57)-(61). Note that the final term (61) is non-negative, so we have the lower bound

[TABLE]

where

[TABLE]

Hence $K_{1}^{\prime}>0$ whenever $P>0$ . Note that we use the assumption $\|x^{n}\|_{F}^{2}=nTP$ freely here, as stated before. The lower bound on the variance (327), we obtain the upper bound

[TABLE]

where all constants are non-negative. There are two cases based on which term achieves the max in the dominator. First, suppose

[TABLE]

Expanding the square yields

[TABLE]

Thus the terms in the numerator are bounded by

[TABLE]

where (333) uses the assumption $\|x_{j}\|_{F}\leq\delta n^{\frac{1}{4}}$ . Applying this to $B_{n}$ in (330), we see that in this case,

[TABLE]

where the constant $C_{1},C_{2},C_{3}$ are non-negative constants.

Now take the case when

[TABLE]

Note that since $K_{1}^{\prime}>0$ , in the case we must also have $K_{2}^{\prime}>0$ for the above inequality to hold. Let $a$ be defined as follows

[TABLE]

Here $a<1$ since $K_{1}^{\prime}/K_{2}^{\prime}>0$ . Applying (336) yields

[TABLE]

With this, from (330) we obtain the following upper bound

[TABLE]

where (341) uses (339). Now, we can upper bound each term in (341) as

[TABLE]

where in (344) we have used $\sum_{i=1}^{n}a_{i}^{3}\leq n^{1/4}\left(\sum_{i=1}^{n}a_{i}^{4}\right)^{3/4}$ (easily obtained from p-norm inequalities), and both (342) and (346) use the assumption $\|x_{j}\|_{F}\leq\delta n^{\frac{1}{4}}$ . Using these bounds in (341), we obtain

[TABLE]

where $C_{1}^{\prime},C_{2}^{\prime},C_{3}^{\prime}$ are non-negative constants.

From (335) and (347), we conclude that

[TABLE]

∎

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,” IEEE Trans. Inf. Theory , vol. 56, no. 5, pp. 2307–2359, May 2010.
2[2] R. L. Dobrushin, “Mathematical problems in the Shannon theory of optimal coding of information,” in Proc. 4th Berkeley Symp. Mathematics, Statistics, and Probability , vol. 1, Berkeley, CA, USA, 1961, pp. 211–252.
3[3] V. Strassen, “Asymptotische Abschätzungen in Shannon’s Informationstheorie,” in Trans. 3d Prague Conf. Inf. Theory , Prague, 1962, pp. 689–723.
4[4] Y. Polyanskiy and S. Verdú, “Finite blocklength methods in information theory (tutorial),” in 2013 IEEE Int. Symp. Inf. Theory (ISIT) , Istanbul, Turkey, Jul. 2013. [Online]. Available: http://people.lids.mit.edu/yp/homepage/data/ISIT 13_tutorial.pdf
5[5] V. Y. Tan et al. , “Asymptotic estimates in information theory with non-vanishing error probabilities,” Foundations and Trends® in Communications and Information Theory , vol. 11, no. 1-2, pp. 1–184, 2014.
6[6] E. Telatar, “Capacity of multi-antenna Gaussian channels,” European Trans. Telecom. , vol. 10, no. 6, pp. 585–595, 1999.
7[7] G. J. Foschini and M. J. Gans, “On limits of wireless communications in a fading environment when using multiple antennas,” Wireless personal communications , vol. 6, no. 3, pp. 311–335, 1998.
8[8] S. M. Alamouti, “A simple transmit diversity technique for wireless communications,” Selected Areas in Communications, IEEE Journal on , vol. 16, no. 8, pp. 1451–1458, 1998.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Coherent multiple-antenna block-fading channels at finite blocklength

Abstract

I Introduction

II Main Results

II-A Channel Model

II-B Statement of Main Theorems

Theorem 1**.**

Proof.

Remark 1**.**

Remark 2**.**

Theorem 2**.**

Proof.

Remark 3**.**

Theorem 3**.**

Proof.

Theorem 4** (Hurwitz-Radon).**

Theorem 5**.**

III Preliminary results

III-A Known results: capacity and capacity achieving output distribution

III-B Capacity achieving input distributions

Theorem 6**.**

Remark 4**.**

Remark 5**.**

Proof.

III-C Information density and its moments

Lemma 7**.**

Proof.

Remark 6**.**

Proposition 8**.**

Remark 7**.**

Proof.

Lemma 9**.**

Proof of Lemma 9.

Proposition 10**.**

Proof.

Proposition 11**.**

Proof.

Proposition 12**.**

Proof.

Lemma 13**.**

III-D Hypothesis testing

Lemma 14**.**

Proof.

IV Achievability

Theorem 15** (κβ\kappa\betaκβ bound).**

Theorem 16**.**

Proof.

Lemma 17**.**

Lemma 18**.**

Proof of Lemma 17.

Proof of Lemma 18.

V Converse

Theorem 19**.**

Proof.

Lemma 20**.**

Remark 8**.**

Lemma 21**.**

Remark 9**.**

Proof of Lemma 20.

Proof of Lemma 21.

VI The rank 1 case

VI-A Orthogonal designs

Definition 1** (Orthogonal Design).**

Definition 2** (Generalized Orthogonal Design).**

Proposition 22**.**

Proof.

Remark 10**.**

VI-B Proof of theorem 5

Lemma 23**.**

Proof.

Proof of Theorem 5.

Remark 11**.**

VI-C Beyond full-rate orthogonal designs

VII Discussion

Theorem 1.

Remark 1.

Remark 2.

Theorem 2.

Remark 3.

Theorem 3.

Theorem 4 (Hurwitz-Radon).

Theorem 5.

Theorem 6.

Remark 4.

Remark 5.

Lemma 7.

Remark 6.

Proposition 8.

Remark 7.

Lemma 9.

Proposition 10.

Proposition 11.

Proposition 12.

Lemma 13.

Lemma 14.

Theorem 15 ( $\kappa\beta$ bound).

Theorem 16.

Lemma 17.

Lemma 18.

Theorem 19.

Lemma 20.

Remark 8.

Lemma 21.

Remark 9.

Definition 1 (Orthogonal Design).

Definition 2 (Generalized Orthogonal Design).

Proposition 22.

Remark 10.

Lemma 23.

Remark 11.

Proposition 24.

Remark 12.