Intrinsic Capacity

Shengtian Yang; Rui Xu; Jun Chen; Jian-Kang Zhang

arXiv:1706.06858·cs.IT·April 28, 2020

Intrinsic Capacity

Shengtian Yang, Rui Xu, Jun Chen, Jian-Kang Zhang

PDF

TL;DR

This paper investigates the capacity limits of channels with intrinsic states when causal state information is available at the encoder and/or decoder, providing new theoretical insights and specific results for binary channels.

Contribution

It introduces a framework for analyzing channel capacities with intrinsic states and causal information, including generalizations of key theorems and conditions for the usefulness of state information.

Findings

01

Maximum and minimum capacities for binary channels are characterized.

02

A generalization of the Birkhoff-von Neumann theorem is presented.

03

Conditions under which causal state information is useless are identified.

Abstract

Every channel can be expressed as a convex combination of deterministic channels with each deterministic channel corresponding to one particular intrinsic state. Such convex combinations are in general not unique, each giving rise to a specific intrinsic-state distribution. In this paper we study the maximum and the minimum capacities of a channel when the realization of its intrinsic state is causally available at the encoder and/or the decoder. Several conclusive results are obtained for binary-input channels and binary-output channels. Byproducts of our investigation include a generalization of the Birkhoff-von Neumann theorem and a condition on the uselessness of causal state information at the encoder.

Equations285

F (x) = x \oplus N \mbox an d G (x) = N,

F (x) = x \oplus N \mbox an d G (x) = N,

dec (W) := {λ \in P_{D} : W = D \in D \sum λ_{D} D},

dec (W) := {λ \in P_{D} : W = D \in D \sum λ_{D} D},

C_{10} (λ)

C_{10} (λ)

C_{01} (λ)

C_{11} (λ)

I (μ, W) := x \sum μ_{x} D (W_{x, *} ∥ μ W)

I (μ, W) := x \sum μ_{x} D (W_{x, *} ∥ μ W)

C (W) = C_{00} (λ) := μ \in P_{X} max I (μ, D \in D \sum λ_{D} D) = μ \in P_{X} max I (μ, W) .

C (W) = C_{00} (λ) := μ \in P_{X} max I (μ, D \in D \sum λ_{D} D) = μ \in P_{X} max I (μ, W) .

IC_{f} (W) := {C_{f} (λ) : λ \in dec (W)} .

IC_{f} (W) := {C_{f} (λ) : λ \in dec (W)} .

\underline{IC}_{f} (W) := in f IC_{f} (W)

\underline{IC}_{f} (W) := in f IC_{f} (W)

\overline{IC}_{f} (W) := sup IC_{f} (W),

\overline{IC}_{f} (W) := sup IC_{f} (W),

Γ_{λ} (r) := λ {D \in D : rank (D) = r}

Γ_{λ} (r) := λ {D \in D : rank (D) = r}

\underline{Γ}_{W} (r) := λ \in dec (W) min Γ_{λ} (r) \mbox an d \overline{Γ}_{W} (r) := λ \in dec (W) max Γ_{λ} (r),

\underline{Γ}_{W} (r) := λ \in dec (W) min Γ_{λ} (r) \mbox an d \overline{Γ}_{W} (r) := λ \in dec (W) max Γ_{λ} (r),

\underline{IC}_{11} (W) \leq {(1 - \overline{Γ}_{W} (1)) lo g γ 0 \overline{Γ}_{W} (1) < 1, otherwise,

\underline{IC}_{11} (W) \leq {(1 - \overline{Γ}_{W} (1)) lo g γ 0 \overline{Γ}_{W} (1) < 1, otherwise,

\underline{IC}_{11} (W) \geq 1 - \overline{Γ}_{W} (1),

\underline{IC}_{11} (W) \geq 1 - \overline{Γ}_{W} (1),

\overline{Γ}_{W} (1) = j = 1 \sum n α_{j}, α = (1 \leq i \leq m min W_{i, j})_{j \in [[1, n]]},

\overline{Γ}_{W} (1) = j = 1 \sum n α_{j}, α = (1 \leq i \leq m min W_{i, j})_{j \in [[1, n]]},

γ = (m + wt (a) - a 1^{T}) \land n,

γ = (m + wt (a) - a 1^{T}) \land n,

a = ⌊ 1 W^{'} ⌋, W^{'} = \frac{W - \sum _{j = 1}^{n} α _{j} U _{j}}{1 - Γ _{W} ( 1 )},

a = ⌊ 1 W^{'} ⌋, W^{'} = \frac{W - \sum _{j = 1}^{n} α _{j} U _{j}}{1 - Γ _{W} ( 1 )},

\overline{IC}_{11} (W) = 1 - \underline{Γ}_{W} (1);

\overline{IC}_{11} (W) = 1 - \underline{Γ}_{W} (1);

lo g γ \leq \overline{IC}_{11} (W) \leq lo g (o - 1) + \overline{Γ}_{W} (o) lo g \frac{o}{o - 1},

lo g γ \leq \overline{IC}_{11} (W) \leq lo g (o - 1) + \overline{Γ}_{W} (o) lo g \frac{o}{o - 1},

\underline{Γ}_{W} (1) = (g - m + 1)_{+}, g = 1 \leq j \leq n max (1 W)_{j},

\underline{Γ}_{W} (1) = (g - m + 1)_{+}, g = 1 \leq j \leq n max (1 W)_{j},

γ = wt (a) + m - j \in supp (a) \sum b_{j}_{+}, o = m \land n,

γ = wt (a) + m - j \in supp (a) \sum b_{j}_{+}, o = m \land n,

a = ⌊ 1 W ⌋, b = ⌈ 1 W ⌉ .

a = ⌊ 1 W ⌋, b = ⌈ 1 W ⌉ .

\underline{IC}_{10} (W) = C (W)

\underline{IC}_{10} (W) = C (W)

\overline{IC}_{10} (W) = C ((1 \underline{Γ}_{W} (1) 0 1 - \underline{Γ}_{W} (1))) .

\overline{IC}_{10} (W) = C ((1 \underline{Γ}_{W} (1) 0 1 - \underline{Γ}_{W} (1))) .

W = (1 - ϵ_{1} ϵ_{2} ϵ_{1} 1 - ϵ_{2}),

W = (1 - ϵ_{1} ϵ_{2} ϵ_{1} 1 - ϵ_{2}),

\underline{IC}_{11} (W) = \underline{IC}_{01} (W) = ∣1 - ϵ_{1} - ϵ_{2} ∣,

\underline{IC}_{11} (W) = \underline{IC}_{01} (W) = ∣1 - ϵ_{1} - ϵ_{2} ∣,

\operatorname{\underline{\operatorname{IC}}}_{10}(W)=\operatorname{C}(W)=\begin{cases}\log\left(2^{\frac{\epsilon_{2}h(\epsilon_{1})-(1-\epsilon_{1})h(\epsilon_{2})}{1-\epsilon_{1}-\epsilon_{2}}}+2^{\frac{\epsilon_{1}h(\epsilon_{2})-(1-\epsilon_{2})h(\epsilon_{1})}{1-\epsilon_{1}-\epsilon_{2}}}\right),&$\epsilon_1+\epsilon_2\neq 1$,\\ 0&$\epsilon_1+\epsilon_2=1$,\end{cases}

\operatorname{\underline{\operatorname{IC}}}_{10}(W)=\operatorname{C}(W)=\begin{cases}\log\left(2^{\frac{\epsilon_{2}h(\epsilon_{1})-(1-\epsilon_{1})h(\epsilon_{2})}{1-\epsilon_{1}-\epsilon_{2}}}+2^{\frac{\epsilon_{1}h(\epsilon_{2})-(1-\epsilon_{2})h(\epsilon_{1})}{1-\epsilon_{1}-\epsilon_{2}}}\right),&$\epsilon_1+\epsilon_2\neq 1$,\\ 0&$\epsilon_1+\epsilon_2=1$,\end{cases}

\overline{IC}_{11} (W) = \overline{IC}_{01} (W) = 1 - ∣ ϵ_{1} - ϵ_{2} ∣,

\overline{IC}_{11} (W) = \overline{IC}_{01} (W) = 1 - ∣ ϵ_{1} - ϵ_{2} ∣,

\operatorname{\overline{\operatorname{IC}}}_{10}(W)=\begin{cases}1&$\epsilon_1=\epsilon_2$,\\ \log\left(1+(1-|\epsilon_{1}-\epsilon_{2}|)|\epsilon_{1}-\epsilon_{2}|^{\frac{|\epsilon_{1}-\epsilon_{2}|}{1-|\epsilon_{1}-\epsilon_{2}|}}\right)&$|\epsilon_1-\epsilon_2|\in(0,1)$,\\ 0,&$|\epsilon_1-\epsilon_2|=1$,\end{cases}

\operatorname{\overline{\operatorname{IC}}}_{10}(W)=\begin{cases}1&$\epsilon_1=\epsilon_2$,\\ \log\left(1+(1-|\epsilon_{1}-\epsilon_{2}|)|\epsilon_{1}-\epsilon_{2}|^{\frac{|\epsilon_{1}-\epsilon_{2}|}{1-|\epsilon_{1}-\epsilon_{2}|}}\right)&$|\epsilon_1-\epsilon_2|\in(0,1)$,\\ 0,&$|\epsilon_1-\epsilon_2|=1$,\end{cases}

\underline{IC}_{11} (W) = \underline{IC}_{01} (W) = ∣1 - 2 ϵ ∣,

\underline{IC}_{11} (W) = \underline{IC}_{01} (W) = ∣1 - 2 ϵ ∣,

\underline{IC}_{10} (W) = C (W) = 1 - h (ϵ),

\underline{IC}_{10} (W) = C (W) = 1 - h (ϵ),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Intrinsic Capacity

Shengtian Yang, Rui Xu, Jun Chen, Jian-Kang Zhang00footnotetext: This work was supported in part by the National Natural Science Foundation of China under Grant 61571398 and in part by the Natural Science and Engineering Research Council (NSERC) of Canada under a Discovery Grant. This paper is to be presented in part at the 2017 IEEE International Symposium on Information Theory.

00footnotetext: S. Yang is with the School of Information and Electronic Engineering, Zhejiang Gongshang University, Hangzhou 310018, China, and was also with the Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4K1, Canada (e-mail: [email protected]).

00footnotetext: R. Xu, J. Chen, and J.-K. Zhang are with the Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4K1, Canada (e-mail: [email protected]; [email protected] [email protected]).

Abstract

Every channel can be expressed as a convex combination of deterministic channels with each deterministic channel corresponding to one particular intrinsic state. Such convex combinations are in general not unique, each giving rise to a specific intrinsic-state distribution. In this paper we study the maximum and the minimum capacities of a channel when the realization of its intrinsic state is causally available at the encoder and/or the decoder. Several conclusive results are obtained for binary-input channels and binary-output channels. Byproducts of our investigation include a generalization of the Birkhoff-von Neumann theorem and a condition on the uselessness of causal state information at the encoder.

Index Terms — Birkhoff-von Neumann theorem, channel capacity, deterministic channel, state information.

1 Introduction

A discrete channel is commonly viewed as a black box with the input-output relation characterized by a stochastic matrix. In practice, it is often possible to obtain some additional information (known as the state information) by probing the channel. The knowledge of the state information might be useful in increasing the channel capacity. Note that, given each state, the channel can again be viewed as a black box and can potentially be further probed. One may continue this process until the black box is fully opened, i.e., the channel becomes deterministic given the acquired state information. This line of thought suggests that every channel has its own intrinsic state, which fully captures the randomness of the channel, and any state information acquired via channel probing is a degenerate version of this intrinsic state. As such, the intrinsic capacity, defined as the capacity of a channel when its intrinsic state is revealed, determines the ultimate capacity gain one can hope for by probing the channel.

It turns out that the intrinsic capacity of a channel is not necessarily uniquely defined. Consider a binary symmetric channel with crossover probability $0.5$ : $W=(W_{x,y})_{x\in\{0,1\},y\in\{0,1\}}=(\begin{smallmatrix}0.5&0.5\\ 0.5&0.5\end{smallmatrix}),$ where each entry $W_{x,y}$ denoting the conditional probability $W(y\mid x)$ of output $y$ given input $x$ . The capacity of $W$ is clearly zero. For this channel, we consider the following two models:

[TABLE]

where $\oplus$ denotes the modulo- $2$ addition and $N$ is uniformly distributed over $\{0,1\}$ . It is easy to verify that they both have the conditional probability distribution $W$ . If the actual model of $W$ is $F$ , then for every realization of $N$ , $W$ becomes a deterministic perfect channel, $(\begin{smallmatrix}1&0\\ 0&1\end{smallmatrix})$ or $(\begin{smallmatrix}0&1\\ 1&0\end{smallmatrix}),$ so that the capacity of $W$ with $N$ available at the encoder and/or the decoder increases to one. On the other hand, if the actual model of $W$ is $G$ , then for every realization of $N$ , $W$ becomes a deterministic useless channel, $(\begin{smallmatrix}1&0\\ 1&0\end{smallmatrix})$ or $(\begin{smallmatrix}0&1\\ 0&1\end{smallmatrix}),$ and hence, even with $N$ known at both sides, the capacity of $W$ is still zero. In fact, it will be seen that, for every number $r\in[0,1]$ , one can find a model for $W$ such that the resulting intrinsic capacity is $r$ .

This example indicates that a channel may admit different decompositions into deterministic channels. All these decompositions are mathematically legitimate though the actual way the deterministic channels are mixed to produce the given channel depends on the underlying physical mechanism. In this work we study the minimum and the maximum intrinsic capacities of a channel over all admissible decompositions. They will be referred to as the lower intrinsic capacity and the upper intrinsic capacity. For the aforementioned channel $W$ , its lower and upper intrinsic capacities are 0 and 1, respectively. Since the causal state information may be available at the encoder, the decoder, or both, there are totally three different notions of lower and upper intrinsic capacities of a channel $W$ , denoted by $\operatorname{\underline{\operatorname{IC}}}_{f}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{f}(W)$ , for $f=10,01,11$ , where the two bits indicate if the state information is available at the encoder and the decoder, respectively.

The main contributions of this work are:

We study the structure of the convex polytope $\operatorname{dec}(W)$ , which consists of all convex combinations of deterministic channels for channel $W$ , with a particular focus on its vertices. It is shown that $\operatorname{\overline{\operatorname{IC}}}_{f}(W)$ for all $f\in\{10,01,11\}$ and $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ are attained at certain vertices of $\operatorname{dec}(W)$ (Theorem A.1).

We prove a generalization of the Birkhoff-von Neumann theorem for a family $\mathcal{W}[a,b]$ of channel matrices with integer-valued column-sum vector constraints $a$ and $b$ from below and above, respectively (Theorem 4.7). It is shown that $\mathcal{W}[a,b]$ is convex and its vertices are exactly all deterministic channels in $\mathcal{W}[a,b]$ . Using this fundamental result, we determine the exact values of $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ when the input or the output is binary. General lower and upper bounds are further provided for the nonbinary cases (Theorems 3.3 and 3.4), and in some cases, the exact value of $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ is also determined.

We obtain the exact values of $\operatorname{\underline{\operatorname{IC}}}_{10}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{10}(W)$ when $W$ is a binary-output channel (Theorem 3.5), and obtain the exact values of $\operatorname{\underline{\operatorname{IC}}}_{01}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{01}(W)$ (Proposition 3.6) when $W$ is a binary-input channel. An interesting phenomenon observed is that $\operatorname{\underline{\operatorname{IC}}}_{10}(W)=\operatorname{C}(W)$ for binary-output $W$ , where $\operatorname{C}(W)$ denotes the capacity of $W$ . In other words, every binary-output channel can be generated through a certain mechanism such that the capacity remains the same if the source of randomness is causally revealed to the encoder. We further prove that the causal state information at the encoder is useless for a broad class of channels (Theorem 4.12). Finally, by providing some counterexamples, we show that the results such as $\operatorname{\underline{\operatorname{IC}}}_{10}(W)=\operatorname{C}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{01}(W)=\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ are specific to binary-input or binary-output channels, and do not hold in general (Example E.1 and Proposition F.1).

The rest of this paper is organized as follows. Section 2 lists some common notations used throughout this paper. Section 3 provides the definitions of various notions of (lower/upper) intrinsic capacity and a summary of the main results of this paper. The proofs and some other relevant findings are presented in Section 4 and the appendices.

2 Notations

Although most notations will be defined at their first occurrences, some common ones are listed here for easy reference.

$[\![x,y]\!]$

The set of integers in the interval $[x,y]$ .

$B^{A}$

The set of all maps $f\colon A\to B$ , or equivalently, the set of all indexed families $x=(x_{i}\in B)_{i\in A}$ (a generalized form of sequences). If $A=[\![1,n]\!]$ , then $B^{A}$ degenerates to the Cartesian product $B^{n}$ . In this paper, a vector (for example, in $\mathbf{R}^{n}$ ) will be regarded as a row vector, and an all- $c$ vector is usually denoted by $\bm{c}$ .

$x\wedge y$

The minimum of $x$ and $y$ .

$x\vee y$

The maximum of $x$ and $y$ .

$\operatorname{supp}(x)$

The support set ${\left\{i\in I\colon x_{i}\neq 0\right\}}$ of $x=(x_{i})_{i\in I}$ .

$\operatorname{wt}(x)$

The weight ${\left|\operatorname{supp}(x)\right|}$ of $x=(x_{i})_{i\in I}$ .

${\left\lfloor x\right\rfloor}$

The largest integer $\leq x$ . If the argument is a sequence $x=(x_{i})_{i\in I}$ , then ${\left\lfloor x\right\rfloor}:=({\left\lfloor x_{i}\right\rfloor})_{i\in I}$ . The same convention also applies to other functions such as $|x|$ , ${\left\lceil x\right\rceil}$ , $(x)_{+}$ , and $(x)_{-}$ .

${\left\lceil x\right\rceil}$

The smallest integer $\geq x$ .

$(x)_{+}$

$x\vee 0$ .

$(x)_{-}$

$x\wedge 0$ .

$\log x$

$\log_{2}x$ .

3 Definitions and Main Results

Let $\mathcal{X}$ and $\mathcal{Y}$ be two finite sets. A channel $W\colon\mathcal{X}\to\mathcal{Y}$ is a stochastic matrix with each entry $W_{x,y}$ , or conventionally, $W(y\mid x)$ denoting the probability of output $y\in\mathcal{Y}$ given input $x\in\mathcal{X}$ . A deterministic channel $D\colon\mathcal{X}\to\mathcal{Y}$ is a special channel whose stochastic matrix is a zero-one matrix, as such it uniquely identifies a map of $\mathcal{X}$ into $\mathcal{Y}$ . In the sequel, deterministic channels and maps will be regarded as equivalent objects and denoted using the same notation.

It is clear that the set of all channels forms a convex polytope in $\mathbf{R}^{\mathcal{X}\times\mathcal{Y}}$ . We denote this polytope by $\mathcal{W}_{\mathcal{X},\mathcal{Y}}$ , or succinctly, $\mathcal{W}$ . The deterministic channels are exactly the vertices of $\mathcal{W}$ , and every channel can be expressed as a convex combination of them. This simple observation suggests that, for any channel, one can define a random state variable (referred to as the intrinsic state) given which the channel becomes deterministic. We are interested in characterizing the capacity of a channel when its intrinsic state is available at the encoder and/or the decoder. Such capacity results are of fundamental importance since they delineate the potential gain that can be achieved by probing the channel.

For a given channel, there are often multiple ways to write it as a convex combination of deterministic channels; as a consequence, the distribution of its intrinsic state is in general not uniquely defined. Let $\mathcal{D}_{\mathcal{X},\mathcal{Y}}$ (or simply $\mathcal{D}$ ) denote the set of all deterministic channels $\mathcal{X}\to\mathcal{Y}$ . Then the set of all possible convex decompositions of a channel $W$ is given by

[TABLE]

where $\mathcal{P}_{\mathcal{D}}$ is the set of all probability distributions over $\mathcal{D}$ and can be regarded as the set $\mathcal{W}_{\{\emptyset\},\mathcal{D}}$ of matrices or vectors. For each intrinsic-state distribution $\lambda\in\mathcal{P}_{\mathcal{D}}$ , we define the resulting capacities when the intrinsic state is causally available at the encoder, the decoder, or both, by

[TABLE]

respectively (see [1, Chapter 7]), where

[TABLE]

and the flag $f\in\{10,01,11\}$ indicates the availability of the intrinsic state at the encoder and the decoder. For example, $10$ means that the intrinsic state is available at the encoder but not at the decoder. For completeness, we also define the capacity with no encoder and decoder side information:

[TABLE]

Then, given a channel $W$ , we can define its intrinsic-capacity set by

[TABLE]

Furthermore, we define the lower intrinsic capacity and the upper intrinsic capacity of $W$ for $f\in\{10,01,11\}$ by

[TABLE]

and

[TABLE]

respectively.

Remark 3.1.

Using the functional representation lemma [1, p. 626][2, Lemma 1], it can be easily shown that $\operatorname{\overline{\operatorname{IC}}}_{f}(W)$ provides an upper bound on the capacity of $W$ with any form of state information whose availability at the encoder and the decoder is specified by $f$ . On the other hand, from the minimax theorem [3], Proposition B.1, and [1, Theorems 7.1 and 7.2, Eqs. (7.2) and (7.3), and Remark 7.6], it follows that $\operatorname{\underline{\operatorname{IC}}}_{f}(W)$ is exactly the capacity of the compound channel $(S)_{p_{S}\in\operatorname{dec}(W)}$ with the availability of $S$ at the encoder and the decoder specified by $f$ , where $S$ is $\mathcal{D}$ -valued, i.e., a random deterministic channel, and $p_{S}$ is selected arbitrarily from $\operatorname{dec}(W)$ .

The main results of this paper are given as follows. With no loss of generality, we assume from now on that the channel $W$ is from $[\![1,m]\!]$ to $[\![1,n]\!]$ , where $m,n\geq 2$ .

Definition 3.2.

Let

[TABLE]

be the rank probability function over $\mathcal{D}$ induced by $\lambda\in\operatorname{dec}(W)$ . The lower and the upper rank- $r$ probabilities of $W$ are then defined by

[TABLE]

respectively.

Bounds for $\operatorname{\underline{\operatorname{\Gamma}}}_{W}(r)$ and $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(r)$ when $r=1$ and $r=m\wedge n$ are given by Propositions 4.8 and 4.10, respectively. Most of our results will be expressed in terms of these quantities.

Theorem 3.3.

[TABLE]

where

[TABLE]

and $\mathrm{U}_{j}$ is the deterministic useless channel with its $j$ -th column being all one.

If $m=2$ or $n=2$ , then $\operatorname{\underline{\operatorname{IC}}}_{11}(W)=1-\operatorname{\overline{\operatorname{\Gamma}}}_{W}(1)$ .

Theorem 3.4.

If $\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)>0$ or $m=2$ or $n=2$ , then

[TABLE]

otherwise,

[TABLE]

where

[TABLE]

If $m\leq n$ and $\bm{1}W\leq\bm{1}$ , then $\operatorname{\overline{\operatorname{IC}}}_{11}(W)=\log m$ .

If $m\geq n$ and $\bm{1}W\geq\bm{1}$ , then $\operatorname{\overline{\operatorname{IC}}}_{11}(W)=\log n$ .

Theorem 3.5.

If $n=2$ , then

[TABLE]

and

[TABLE]

Proposition 3.6.

If $m=2$ , then for every $\lambda\in\operatorname{dec}(W)$ , $\operatorname{C}_{01}(\lambda)=\operatorname{C}_{11}(\lambda)$ , so that $\operatorname{\underline{\operatorname{IC}}}_{01}(W)=1-\operatorname{\overline{\operatorname{\Gamma}}}_{W}(1)$ and $\operatorname{\overline{\operatorname{IC}}}_{01}(W)=1-\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)$ .

The above results enable us to obtain explicit characterizations of all lower and upper intrinsic capacities for binary-input binary-output channels. The relevant expressions are collected in the following example.

Example 3.7.

If $m=n=2$ and

[TABLE]

then

[TABLE]

where $h(\epsilon):=-\epsilon\log\epsilon-(1-\epsilon)\log(1-\epsilon)$ is the binary entropy function. If $W$ is a binary symmetric channel with crossover probability $\epsilon$ (i.e., $\epsilon_{1}=\epsilon_{2}=\epsilon$ ), then

[TABLE]

If $W$ is a Z-channel with crossover probability $\theta$ (i.e., $\epsilon_{1}=0$ and $\epsilon_{2}=\theta$ ), then

[TABLE]

The case of Z-channel is special, because in this case $W$ admits a unique convex decomposition into deterministic channels:

[TABLE]

The lower and the upper intrinsic capacities of these two special channels are plotted in Figs. 1 and 2.

4 Proofs of Main Results

It is clear that $\operatorname{dec}(W)$ is bounded, closed, and convex, so it can be easily shown that $\operatorname{IC}_{f}(W)$ is a closed interval and that $\operatorname{\overline{\operatorname{IC}}}_{f}(W)$ for all $f\in\{10,01,11\}$ and $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ are attained at certain vertices of $\operatorname{dec}(W)$ (Theorem A.1). As such, it is of great importance to study the structure of $\operatorname{dec}(W)$ . A series of results on the vertices of $\operatorname{dec}(W)$ is provided in Appendix A. Although these results shed useful light on the structure of $\operatorname{dec}(W)$ , the characterizations are still too coarse for our purpose. It will be seen that additional insights can be gained by taking the objective functions into consideration.

4.1 $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$

We first provide a complete characterization of $\lambda$ that achieves $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ or $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ .

Proposition 4.1.

Let

[TABLE]

and

[TABLE]

For $\lambda\in\operatorname{dec}(W)$ , $\operatorname{C}_{11}(\lambda)=\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ iff there is no $U\in\mathfrak{U}_{+}$ such that $U\subseteq\operatorname{supp}(\lambda)$ ; $\operatorname{C}_{11}(\lambda)=\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ iff there is no $U\in\mathfrak{U}_{-}$ such that $U\subseteq\operatorname{supp}(\lambda)$ .

Proof.

It suffices to prove the first part, because the second part can be proved in the same vein.

(Sufficiency) If there exists some $\beta\in\operatorname{dec}(W)$ such that $\operatorname{C}_{11}(\beta)<\operatorname{C}_{11}(\lambda)$ , then

[TABLE]

and

[TABLE]

so that $U=\operatorname{supp}((\lambda-\beta)_{+})\in\mathfrak{U}_{+}$ and $U\subseteq\operatorname{supp}(\lambda)$ , a contradiction.

(Necessity) For $U\in\mathfrak{U}_{+}$ , if $U\subseteq\operatorname{supp}(\lambda)$ , then there is a vector $\alpha\in\mathbf{R}^{\mathcal{D}}$ such that $\operatorname{supp}((\alpha)_{+})\subseteq\operatorname{supp}(\lambda)$ , $\sum_{D}\alpha_{D}D=0$ , and $\sum_{D}\alpha_{D}\log\operatorname{rank}(D)>0$ . Let $\beta=\lambda-t\alpha$ . For sufficiently small $t>0$ , it can be verified that $\beta\in\operatorname{dec}(W)$ and $\operatorname{C}_{11}(\beta)<\operatorname{C}_{11}(\lambda)=\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ , which is absurd. ∎

Definition 4.2.

A subset $S\subseteq\mathcal{D}$ is said to be $\operatorname{IC}_{11}$ -minimized, or succinctly, $\operatorname{IC}$ -minimized (resp., $\operatorname{IC}$ -maximized) if there is a $\lambda\in\mathcal{P}_{\mathcal{D}}$ such that $\operatorname{supp}(\lambda)=S$ and $\operatorname{C}_{11}(\lambda)=\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ (resp., $\operatorname{C}_{11}(\lambda)=\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ ), where $W=\sum_{D\in\mathcal{D}}\lambda_{D}D$ .

A simple consequence of Proposition 4.1 is:

Proposition 4.3.

If $S\subseteq\mathcal{D}$ is $\operatorname{IC}$ -minimized (resp., $\operatorname{IC}$ -maximized), then any $\lambda\in\mathcal{P}_{\mathcal{D}}$ supported on $S$ achieves $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ (resp., $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ ), where $W=\sum_{D\in\mathcal{D}}\lambda_{D}D$ . As a consequence, any nonempty subset of $S$ is also $\operatorname{IC}$ -minimized (resp., $\operatorname{IC}$ -maximized).

By Proposition 4.3, it is important to identify patterns of sets that are not $\operatorname{IC}$ -minimized or $\operatorname{IC}$ -maximized. Some simple patterns that are not $\operatorname{IC}$ -minimized or $\operatorname{IC}$ -maximized are given as follows and their proofs are relegated to Appendix C.

Proposition 4.4.

If $m\leq n$ , then any deterministic perfect channels $P_{1}$ , …, $P_{\ell}$ such that at least one column of $P_{1}+\cdots+P_{\ell}$ has a weight greater than one are not $\operatorname{IC}$ -minimized.

Proposition 4.5.

If $m\geq n$ , then any deterministic perfect channels $P_{1}$ , …, $P_{\ell}$ such that at least one column of $P_{1}+\cdots+P_{\ell}$ has no entry equal to $\ell$ are not $\operatorname{IC}$ -minimized.

Proposition 4.6.

For $D\in\mathcal{D}$ , if $\operatorname{wt}(D_{*,j})\leq m-2$ , then $\{D,\mathrm{U}_{j}\}$ is not $\operatorname{IC}$ -maximized.

The next result is a generalization of the Birkhoff-von Neumann theorem, which plays a crucial role in proving Theorems 3.3 and 3.4. Our proof hinges on an extension of the ideas in [4, 5].

Theorem 4.7.

Let $a$ and $b$ be two $n$ -dimensional integer-valued vectors such that $a\leq b$ , namely, $a_{j}\leq b_{j}$ for $1\leq j\leq n$ . Let

[TABLE]

and $\mathcal{D}[a,b]:=\mathcal{W}[a,b]\cap\mathcal{D}$ , where $\bm{1}$ denotes the $m$ -dimensional all-one row vector. If $\mathcal{W}[a,b]$ is not empty, then $\mathcal{W}[a,b]$ is convex and the vertices of $\mathcal{W}[a,b]$ are exactly the matrices in $\mathcal{D}[a,b]$ .

Proof.

It is clear that $\mathcal{W}[a,b]$ , if nonempty, is a convex set. We will show that any matrix $W\in\mathcal{W}[a,b]$ with non-integer entries cannot be a vertex of $\mathcal{W}[a,b]$ . There are two cases:

Case (a): There is a non-integer entry in a non-boundary column.

Case (b): All non-integer entries are in the boundary columns.

Here, a column is called a boundary column if its sum is either $a_{j}$ or $b_{j}$ , where $j$ is the index of the column.

In whichever the case, we can pick a non-integer entry, say the $(i_{0},j_{0})$ entry, which in Case (a) must be a non-integer entry in a non-boundary column. By the following argument, we will find a chain or loop of non-integer entries of the matrix, which will be used to prove that the matrix is not extremal.

Because the $(i_{0},j_{0})$ entry is not an integer, there exists at least another entry in the same row that is also not an integer, say the $(i_{0},j_{1})$ entry. If the $j_{1}$ -th column is not on the boundary, then we are done. If however the $j_{1}$ -th column is on the boundary, then there exists at least another non-integer entry in the same column, say $(i_{1},j_{1})$ . In general, after $t$ steps, we have visited $t+1$ columns, with the chain

[TABLE]

Except for the $j_{0}$ -th column, every column has exactly one inbound entry $(i_{s-1},j_{s})$ and one outbound entry $(i_{s},j_{s})$ , where $1\leq s\leq t$ . Now in the $(t+1)$ -th step, by the same argument, we find the $(i_{t},j_{t+1})$ entry in the $j_{t+1}$ -th column. If this column has already been visited, then $j_{t+1}=j_{s}$ for some $0\leq s\leq t-1$ and we are done. If this column is new but not on the boundary, we are also done. If however this new column is on boundary, then we can further find an outbound entry in this column, say $(i_{t+1},j_{t+1})$ , and proceed to the $(t+2)$ -th step. Because there are finite columns, we will always end up with a chain

[TABLE]

which only happens in Case (a), or a loop

[TABLE]

for some $0\leq\ell<k-1$ .

Then we can construct a matrix $N$ by setting all outbound entries (in the chain or the loop) $N_{i_{s},j_{s}}=1$ , all inbound entries $N_{i_{s-1},j_{s}}=-1$ , and all other entries to be zero. It is clear that

[TABLE]

in the former case and

[TABLE]

in the latter case, where $\mathrm{e}_{k}=(1\{j=k\})_{j\in[\![1,n]\!]}$ .

Let $U=W+\epsilon N$ and $V=W-\epsilon N$ . It is clear that $U,V\in\mathcal{W}[a,b]$ for sufficiently small $\epsilon>0$ . It is also clear that $W=\frac{1}{2}U+\frac{1}{2}V$ and $U\neq V$ , that is, $W$ is not a vertex of $\mathcal{W}[a,b]$ .

Therefore, we have $\mathcal{V}\subseteq\mathcal{D}[a,b]$ , where $\mathcal{V}$ denotes the set of all vertices of $\mathcal{W}[a,b]$ . It remains to show that $\mathcal{D}[a,b]\subseteq\mathcal{V}$ . For any $W\in\mathcal{D}[a,b]$ , if $W=\alpha U+(1-\alpha)V$ with $U,V\in\mathcal{W}[a,b]$ and $\alpha\in(0,1)$ , then for every $1\leq i\leq m$ ,

[TABLE]

which however implies that $\mathrm{e}_{i}U=\mathrm{e}_{i}V$ for every $1\leq i\leq m$ , or $U=V$ . ∎

Equipped with Theorem 4.7, we proceed to derive bounds for the lower and the upper rank probabilities (Definition 3.2). These bounds are useful in estimating the lower and the upper intrinsic capacities.

Proposition 4.8.

[TABLE]

where

[TABLE]

Proof.

By Theorem 4.7, $W$ can be expressed as a convex combination of deterministic channels of rank $\geq 2$ if $g\leq m-1$ , in which case, $\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)=0$ . Otherwise, let $\ell$ be the index of the column with the sum $g>m-1$ . Consider the convex combination

[TABLE]

It is clear that $W^{\prime}$ cannot be a convex combination of deterministic channels of rank $\geq 2$ unless the sum of its $\ell$ -th column is $\leq m-1$ . To this end, we set $t=g-m+1$ , which is the minimum value required, and we have

[TABLE]

and

[TABLE]

for $j\neq\ell$ , so that $\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)=g-m+1$ .

If $W$ has the following convex decomposition

[TABLE]

then $W^{\prime}$ is a valid stochastic matrix iff $s_{j}\leq\alpha_{j}$ for all $j$ . Therefore, $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(1)=\sum_{j=1}^{n}\alpha_{j}$ . ∎

Proposition 4.9.

If $\lambda\in\operatorname{dec}(W)$ achieves $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ , then $\operatorname{\Gamma}_{\lambda}(1)=\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)$ . In particular, if $\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)>0$ , then $\lambda_{\mathrm{U}_{\ell}}=\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)$ and $\operatorname{\Gamma}_{\lambda}(2)=1-\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)$ , where $\ell=\arg\max_{1\leq j\leq n}(\bm{1}W)_{j}$ .

Proof.

If $\lambda$ is zero on all deterministic useless channels, then $\operatorname{\Gamma}_{\lambda}(1)=\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)=0$ .

If $\lambda_{\mathrm{U}_{j}}>0$ for some $j$ , then $\lambda$ must be zero on all deterministic channels whose $j$ -th column weight is less than $m-1$ (Propositions 4.3 and 4.6). Therefore, we must have $\lambda_{\mathrm{U}_{j}}=\operatorname{\Gamma}_{\lambda}(1)=\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)$ (Proposition 4.8) and $\operatorname{\Gamma}_{\lambda}(2)=1-\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)$ . ∎

Proposition 4.10.

If $m\leq n$ , then

[TABLE]

where

[TABLE]

and

[TABLE]

Furthermore, if $\beta=0$ , then $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(m)=1$ .

If $m\geq n$ , then

[TABLE]

where

[TABLE]

If $h=1$ , then $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(n)=1$ .

Proof.

If $m\leq n$ , then the sum of every column of a deterministic channel of rank $m$ is at most 1, and for every $1\leq j\leq n$ , $W$ admits a convex decomposition into deterministic channels with the $j$ -th column sum at most $\operatorname{wt}(W_{*,j})$ . Thus for every $\lambda\in\operatorname{dec}(W)$ and every $j$ ,

[TABLE]

so that

[TABLE]

for $\operatorname{wt}(W_{*,j})>1$ and hence $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(m)\leq 1-\beta$ . If $\beta=0$ , which implies that $(\bm{1}W)_{j}\leq 1$ for all $1\leq j\leq n$ , then $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(m)=1$ (Theorem 4.7).

If $m\geq n$ , then the sum of every column of a deterministic channel of rank $n$ is at least $1$ , so that, for every $\lambda\in\operatorname{dec}(W)$ and every $1\leq j\leq n$ ,

[TABLE]

and hence $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(n)\leq h$ . If $h=1$ , which implies $(\bm{1}W)_{j}\geq 1$ for all $1\leq j\leq n$ , then $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(n)=1$ (Theorem 4.7). ∎

We are now ready to prove Theorems 3.3 and 3.4.

Proof of Theorem 3.3.

To find an upper bound of $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ , we need to find a convex decomposition of $M$ as “bad” as possible. To this end, we can first extract from $W$ a collection of useless channels with the total probability $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(1)$ (Proposition 4.8), that is,

[TABLE]

If $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(1)=1$ , then $\operatorname{\underline{\operatorname{IC}}}_{11}(W)=0$ ; otherwise,

[TABLE]

It is clear that $W^{\prime}\in\mathcal{W}[a,\bm{m}]$ , where $\bm{m}$ denotes the all- $m$ row vector. The best deterministic channels in $\mathcal{W}[a,m]$ are those with the number of nonzero columns maximized. The rank of those matrices is

[TABLE]

so $\operatorname{\underline{\operatorname{IC}}}_{11}(W^{\prime})\leq\log((m+\operatorname{wt}(a)-a\bm{1}^{\mathsf{T}})\wedge n)$ (Theorem 4.7).

Let $\lambda$ be a vertex of $\operatorname{dec}(W)$ that attains $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ . Then

[TABLE]

Finally, the special case of $m=2$ or $n=2$ can be easily verified. ∎

Proof of Theorem 3.4.

Let $\lambda$ be a vertex of $\operatorname{dec}(W)$ that attains $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ .

If $\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)>0$ or $m=2$ or $n=2$ , then $\operatorname{\Gamma}_{\lambda}(r)=0$ for all $r>2$ (Proposition 4.9), so that $\operatorname{\overline{\operatorname{IC}}}_{11}(W)=1-\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)$ (Proposition 4.8). The remaining case is then $\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)=0$ .

To find a lower bound of $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ , we need to find a convex decomposition of $W$ as “good” as possible. It is clear that $W\in\mathcal{W}[a,b]$ , so $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ is bounded below by the capacity of the worst deterministic channel in $\mathcal{W}[a,b]$ (Theorem 4.7), which are obviously those with the number of nonzero columns minimized. The capacity of such a channel is $\log\gamma$ , so that $\operatorname{\overline{\operatorname{IC}}}_{11}(W)\geq\log\gamma$ .

On the other hand,

[TABLE]

where $o=m\wedge n$ . The remaining part of the proof is straightforward. ∎

The bounds given by Theorems 3.3 and 3.4 can be improved in various ways. In Theorem 3.3, if $\gamma=m\wedge n$ , then the upper bound for $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(m\wedge n)$ in Proposition 4.10 can be used to improve the upper bound for $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ ; if $\gamma=m=n$ , the upper bound for $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ can be improved by Proposition 4.4 (see Example C.2). The lower bound for $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ can also be improved by $(1-\operatorname{\overline{\operatorname{\Gamma}}}_{W}(1))\vee\operatorname{C}(W)$ because $\operatorname{C}(W)\leq\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ . However, all these improvements are somewhat ad hoc. The fundamental problem to be solved is how we can choose $\lambda$ in order to approach or achieve the lower or the upper intrinsic capacities. In particular, based on Theorems 3.3, we have the following conjecture:

Conjecture 4.11.

For $\lambda\in\operatorname{dec}(W)$ , if $\operatorname{C}_{11}(\lambda)=\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ , then $\operatorname{\Gamma}_{\lambda}(1)=\operatorname{\overline{\operatorname{\Gamma}}}_{W}(1)$ .

4.2 $\operatorname{\underline{\operatorname{IC}}}_{10}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{10}(W)$

Although it is difficult to compute $\operatorname{\underline{\operatorname{IC}}}_{10}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{10}(W)$ in general, their exact values can be determined in the binary-output case, as is shown by Theorem 3.5.

Proof of Theorem 3.5.

Since $n=2$ , we only need to choose two maps from all the $2^{|\mathcal{D}|}=2^{2^{m}}$ maps of $\mathcal{D}$ into $[\![1,m]\!]$ for constructing the capacity-achieving distributions. We denote these two maps by $u$ and $v$ . The optimal strategy for choosing $u,v$ is to maximize $W^{\prime}_{u,1}$ and minimize $W^{\prime}_{v,1}$ , where $W^{\prime}_{u,y}=\sum_{D\in\mathcal{D}}\lambda_{D}D_{u(D),y}$ . There are only two classes of deterministic channels in $\mathcal{D}$ , rank $1$ and rank $2$ . For $D$ of rank $1$ , it does not matter how to choose the values of $u(D)$ and $v(D)$ . For $D$ of rank $2$ , however, we choose $u(D)=i_{1}$ such that $D_{i_{1},1}=1$ and choose $v(D)=i_{2}$ such that $D_{i_{2},1}=0$ . Then we have

[TABLE]

and

[TABLE]

By Proposition 4.8, the maximum of $\operatorname{\Gamma}_{\lambda}(1)=\lambda_{\mathrm{U}_{1}}+\lambda_{\mathrm{U}_{2}}$ is $\alpha_{1}+\alpha_{2}$ with each $\alpha_{j}$ being the maximum of feasible values of $\lambda_{\mathrm{U}_{j}}$ , so that

[TABLE]

Observing that these two rows are exactly those of $W$ , we further have $\operatorname{\underline{\operatorname{IC}}}_{10}(W)=\operatorname{C}(W)$ . Again by Proposition 4.8, the minimum $\operatorname{\underline{\operatorname{\Gamma}}}_{W}(1)$ of $\operatorname{\Gamma}_{\lambda}(1)$ is $(g-m+1)_{+}$ . With no loss of generality, we suppose $g=(1W)_{1}$ . Then the minima of feasible values of $\lambda_{\mathrm{U}_{1}}$ and $\lambda_{\mathrm{U}_{2}}$ are $(g-m+1)_{+}$ and [math], respectively, so that

[TABLE]

The fact that $\operatorname{\underline{\operatorname{IC}}}_{10}(W)=\operatorname{C}(W)$ for binary-output channels is quite intriguing (although it is not true in general when the output is non-binary (Example E.1)). It implies that every binary-output channel can be simulated in a certain way that the capacity cannot be increased even when the encoder has causal access to the source of randomness, i.e., the intrinsic state. The following result shows that, in fact for a fairly broad class of channels, the causal state information at the encoder is useless as far as the capacity is concerned.

Theorem 4.12.

Let $W=W^{\prime}W^{\prime\prime}$ , where $W^{\prime}$ is a channel with binary output and $W^{\prime\prime}$ is a channel with binary input and $W^{\prime\prime}_{1,*}\neq W^{\prime\prime}_{2,*}$ . Suppose

[TABLE]

where $S$ denotes the channel state and $p_{S}$ is its distribution. The capacity of $W$ cannot be increased by the causal state information $S$ at the encoder iff all $K^{(s)}$ with $p_{S}(s)>0$ are $(i_{1},i_{2})$ -ended for some fixed $i_{1}$ and $i_{2}$ , where a binary output channel $K$ is said to be $(i_{1},i_{2})$ -ended if $K_{i_{1},1}=\min_{i}K_{i,1}$ and $K_{i_{2},1}=\max_{i}K_{i,1}$ . In other words, all row vectors of $K$ are contained in the line segment from endpoint $K_{i_{1},*}$ to endpoint $K_{i_{2},*}$ .

Proof.

(Sufficiency) By [1, Theorem 7.2 and Remark 7.6], we consider the channel $V\colon[\![1,m]\!]^{\mathcal{S}}\to[\![1,n]\!]$ given by $V=V^{\prime}W^{\prime\prime}$ and

[TABLE]

Because every channel $K^{(s)}$ is $(i_{1},i_{2})$ -ended, it is easy to show that $V^{\prime}$ is also $(i_{1},i_{2})$ -ended, where $i_{1}$ and $i_{2}$ are regarded as two constant maps from $\mathcal{S}$ to $[\![1,m]\!]$ . Then every row vector of $V$ is contained in the line segment between $V^{\prime}_{i_{1},*}W^{\prime\prime}$ and $V^{\prime}_{i_{2},*}W^{\prime\prime}$ , which implies that $V$ has a capacity-achieving input probability distribution supported on $\{i_{1},i_{2}\}$ (Proposition D.1), and consequently the capacity of $W$ cannot be increased by the causal state information at the encoder.

(Necessity) If the capacity of $W$ cannot be increased by its causal state information at the encoder, then a capacity-achieving input probability distribution of $V$ must have a support, say $\{i_{1},i_{2}\}$ , so that for every map $u\colon\mathcal{S}\to[\![1,m]\!]$ , the vector

[TABLE]

is contained in the line segment between $V_{i_{1},*}$ and $V_{i_{2},*}$ (Proposition D.2), where $i_{1}$ and $i_{2}$ are understood as two constant maps from $\mathcal{S}$ to $[\![1,m]\!]$ . With no loss of generality, we assume $V^{\prime}_{i_{1},1}\leq V^{\prime}_{i_{2},1}$ . For any $t\in\mathcal{S}$ and any $i_{0}\in[\![1,m]\!]$ , we can take $u(t)=i_{0}$ and $u(s)=i_{1}$ for $s\neq t$ , then we get $V^{\prime}_{u,1}\geq V^{\prime}_{i_{1},1}$ , so that $K^{(t)}_{i_{0},1}\geq K^{(t)}_{i_{1},1}$ . Similarly, we have $K^{(t)}_{i_{0},1}\leq K^{(t)}_{i_{2},1}$ . Therefore, every $K^{(s)}$ is $(i_{1},i_{2})$ -ended. ∎

It can be shown via a perturbation and continuity argument that the uselessness of the causal state information at the encoder is not restricted to the channels covered by Theorem 4.12. However, we have not been able to identify a simple explicit condition under which the sufficiency part of Theorem 4.12 can be extended. For example, consider a seemingly natural condition postulated by the following conjecture.

Conjecture 4.13.

Let $W$ be a channel from $[\![1,2]\!]$ to $[\![1,n]\!]$ . Suppose

[TABLE]

where $S$ denotes the state of channel. If for every $1\leq j\leq n$ , $K^{(s)}_{1,j}$ and $K^{(s)}_{2,j}$ have an order (either $\leq$ or $\geq$ ) independent of $s$ , then the capacity of $W$ cannot be increased by the causal state information available at the encoder.

This conjecture is obviously true for $n=2$ . Numerical results indicate that it also holds in many cases when $n>2$ . However it turns out to be false in general as shown by Example E.2.

Theorem 4.12 imposes no restriction on the distribution of the channel state. This universal property motivates us to introduce the following definition.

Definition 4.14.

The state information $S$ of a channel $W(y\mid x,s)$ is said to be universally useless at the encoder if for any $p_{S}$ , the capacity of $W$ with $S$ causally available at the encoder is equal to the capacity of $W^{\prime}(y\mid x)=\sum_{s}p_{S}(s)W(y\mid x,s)$ .

This definition is not void in view of Theorem 4.12 (in fact, according to our numerical results, many channels not covered by Theorem 4.12 also satisfy this definition). Now consider the channel model shown in Fig. 3, where the channel state $S$ is distributed according to $p_{S}$ , and (noisy) state observations $S_{\mathrm{E}}$ and $S_{\mathrm{D}}$ generated by $S$ through $p_{S_{\mathrm{E}},S_{\mathrm{D}}|S}$ are causally available at the encoder and the decoder, respectively. Let $\operatorname{C}(W,S_{\mathrm{E}},S_{\mathrm{D}},p_{S})$ denote the capacity of this channel model.

It is instructive to study the following example (see also Fig. 4) where

[TABLE]

For this example, we assume that $p_{S_{\mathrm{E}}|S}$ is a binary symmetric channel with crossover probability $p\in[0,\frac{1}{2}]$ , and $p_{S_{\mathrm{D}}|S}$ is a binary symmetric channel with crossover probability $q=0.25$ ; furthermore, we assume that $p_{S_{\mathrm{E}}|S}$ is physically degraded with respect to $p_{S_{\mathrm{D}}|S}$ when $p\geq q=0.25$ , and the other way around when $p\leq q=0.25$ . To gain a better understanding, we plot $\operatorname{C}(W,S_{\mathrm{E}},S_{\mathrm{D}},p_{S})$ against $p$ for $p\in[0,\frac{1}{2}]$ in Fig. 5. It turns out that, somewhat counterintuitively, $\operatorname{C}(W,S_{\mathrm{E}},S_{\mathrm{D}},p_{S})$ is maximized when the encoder side information coincides with the decoder side information (i.e., $p=0.25$ ) rather than when the encoder has access to the perfect state information $S$ (i.e., $p=0$ ). As shown by the following theorem, this is in fact a general phenomenon for any channel whose state information is universally useless at the encoder.

Theorem 4.15.

If the state information of $W$ is universally useless at the encoder, then $\operatorname{C}(W,S_{\mathrm{E}},S_{\mathrm{D}},p_{S})$ is maximized when $S_{\mathrm{E}}=S_{\mathrm{D}}$ almost surely (assuming $p_{S,S_{\mathrm{D}}}$ is fixed but $p_{S_{\mathrm{E}}|S,S_{\mathrm{D}}}$ can be arbitrary).

Proof.

It is clear that among all possible forms of encoder side information $S_{\mathrm{E}}$ , $\operatorname{C}(W,S_{\mathrm{E}},S_{\mathrm{D}},p_{S})$ is maximized when $S_{E}=(S,S_{\mathrm{D}})$ (since any other form of $S_{E}$ can be viewed as its degenerate version), i.e.,

[TABLE]

Note that

[TABLE]

where (a) follows from the universal-uselessness property of the state information of $W$ , and the constant $\emptyset$ means no information. This completes the proof. ∎

Roughly speaking, Theorem 4.15 implies that, for the class of channels satisfying Definition 4.14, what the encoder really needs to know is not the state information, but the decoder’s knowledge of the state information; in other words, for such channels, it is important to maintain consensus between the encoder and the decoder. It is also worth noting that Theorem 4.15 reduces to Definition 4.14 when there is no decoder side information.

Another surprising phenomenon revealed by Fig. 5 is that, as $p$ moves away from $0.25$ , the capacity not only decreases but actually drops to the value corresponding to the no encoder side information case once $p$ passes certain thresholds. Again, such a phenomenon is not confined to that specific example. An investigation of this phenomenon in the context where the encoder side information is a degenerate version of the decoder side information can be found in [6].

Similar to Theorem 3.5, we can also determine the exact values of $\operatorname{\underline{\operatorname{IC}}}_{01}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{01}(W)$ when the input is binary. In this case, we have $\operatorname{C}_{01}(\lambda)=\operatorname{C}_{11}(\lambda)$ for all $\lambda\in\operatorname{dec}(W)$ , so that $\operatorname{\underline{\operatorname{IC}}}_{01}(W)=\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{01}(W)=\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ (see Proposition 3.6 and Appendix F). The general case of $\operatorname{\underline{\operatorname{IC}}}_{01}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{01}(W)$ is however quite difficult. Currently, we only know that $\operatorname{\overline{\operatorname{IC}}}_{01}(W)=\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ does not hold in general (Proposition F.1).

5 Conclusion

We have studied the lower and the upper intrinsic capacities of a channel $W$ , denoted by $\operatorname{\underline{\operatorname{IC}}}_{f}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{f}(W)$ , for three different scenarios ( $f=10,01,11$ ) in terms of the availability of the causal state information at the encoder and/or the decoder. Their values are determined in almost all cases when the input or the output are binary, with only two exceptions (which are the binary-input nonbinary-output channels for $f=10$ and the nonbinary-input binary-output channels for $f=01$ ). A deeper understanding of the relevant optimization problems (especially the structure of $\operatorname{dec}(W)$ ) is needed for further progress.

The lower and the upper intrinsic capacities are inherent properties of a channel with clear operational meanings. In particular, they characterize the potential capacity gains that can be achieved with a direct access to the generator of channel randomness by the encoder and/or the decoder. More generally, the notion of intrinsic capacity provides a useful perspective for studying the values of encoder and decoder side information. For example, our analysis of $\operatorname{\underline{\operatorname{IC}}}_{10}(W)$ reveals that for a broad class of channels, the capacity is not necessarily maximized when the encoder has access to the perfect state information. We believe that this surprising finding is just the tip of the iceberg, and this line of research can be fruitfully pursued to uncover many previously unknown phenomena.

Appendix A The Structure of $\operatorname{dec}(W)$

Theorem A.1.

The set $\operatorname{dec}(W)$ is a bounded, closed convex polytope. For each $f\in\{10,01,11\}$ , $\operatorname{IC}_{f}(W)$ is a closed interval and $\operatorname{\overline{\operatorname{IC}}}_{f}(W)$ can be attained at some vertex of $\operatorname{dec}(W)$ . Furthermore, $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ can also be attained at some vertex of $\operatorname{dec}(W)$ .

Proof.

By definition, it is clear that $\operatorname{dec}(W)$ is a bounded, closed convex polytope, so that $\operatorname{IC}_{f}(W)$ is a closed interval (Proposition B.2). It is also easy to see that $\operatorname{C}_{f}(\lambda)$ attains its maximum $\operatorname{\overline{\operatorname{IC}}}_{f}(W)$ at some vertex of $\operatorname{dec}(W)$ and that $\operatorname{C}_{11}(\lambda)$ attains its minimum $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ at some vertex of $\operatorname{dec}(W)$ (Proposition B.2 and [7, Proposition 3.4.1]). ∎

In light of Theorem A.1, we proceed to study the structure of $\operatorname{dec}(W)$ with a focus on its vertices. Our approach is analogous to [4].

Proposition A.2.

Let

[TABLE]

or

[TABLE]

where

[TABLE]

is called the incidence matrix. A probability distribution $\lambda\in\operatorname{dec}(W)$ is a vertex iff for $S\in\mathfrak{S}$ , $S\subseteq\operatorname{supp}(\lambda)$ implies $S=\emptyset$ , or in other words, iff $\operatorname{rank}(I_{S,*})={\left|S\right|}$ .

Proof.

Note that for every $i\in[\![1,m]\!]$ ,

[TABLE]

(Sufficiency) If $\lambda=t\beta+(1-t)\gamma$ for some $\beta,\gamma\in\operatorname{dec}(W)$ and some $0<t<1$ , then $\beta-\gamma=(\lambda-\gamma)/t$ and $\operatorname{supp}(\gamma)\subseteq\operatorname{supp}(\lambda)$ , so that $\operatorname{supp}(\beta-\gamma)\in\mathfrak{S}$ and $\operatorname{supp}(\beta-\gamma)\subseteq\operatorname{supp}(\lambda)$ , hence $\operatorname{supp}(\beta-\gamma)=\emptyset$ , and therefore $\lambda=\beta=\gamma$ is a vertex.

(Necessity) For every nonempty $S\in\mathfrak{S}$ , there is a vector $\alpha\in\mathbf{R}^{\mathcal{D}}$ such that $\operatorname{supp}(\alpha)=S$ and $\sum_{D}\alpha_{D}D=0$ . Let $\beta=\lambda+t\alpha$ and $\gamma=\lambda-t\alpha$ with $t\neq 0$ , so that $\lambda=(\beta+\gamma)/2$ with $\beta\neq\gamma$ . Since $\lambda$ is a vertex, $\beta$ and $\gamma$ must not be elements of $\operatorname{dec}(W)$ for all $t\neq 0$ , or equivalently, $S\not\subseteq\operatorname{supp}(\lambda)$ . ∎

Below are several easy consequences of Proposition A.2.

Proposition A.3.

Let

[TABLE]

A probability distribution $\lambda\in\operatorname{dec}(W)$ is a vertex iff $\operatorname{supp}(\lambda)$ is minimal in $\mathfrak{T}$ , where a minimal pattern in $\mathfrak{T}$ is a set $T\subseteq\mathcal{D}$ such that $T=\operatorname{supp}(\alpha)$ for some $\alpha\in\operatorname{dec}(W)$ and for every $\beta\in\operatorname{dec}(W)$ , $\operatorname{supp}(\beta)\subseteq T$ implies $\beta=\alpha$ .

Proposition A.4.

If $\lambda\in\operatorname{dec}(W)$ is a vertex, then

[TABLE]

Sketch of Proof.

Because of (11), the equations $\alpha I=0$ have at most $m(n-1)+1$ linearly independent equations. This number can be further reduced to $\operatorname{wt}(W)-m+1$ by utilizing the information of $W$ , because all the variables $\alpha_{D}$ with $D_{i,j}=1$ must be zero if the equation $\sum_{D\in\mathcal{D}}\alpha_{D}D_{i,j}=W_{i,j}=0$ . The remaining part of the proof is then straightforward. ∎

Proposition A.4 provides an upper bound for the support size of a vertex in $\operatorname{dec}(W)$ . On the other hand, the following result provides a lower bound for the support size of points in $\operatorname{dec}(W)$ , including all the vertices of $\operatorname{dec}(W)$ .

Proposition A.5.

For any $\lambda\in\operatorname{dec}(W)$ ,

[TABLE]

where $s={\left|\{W_{i,j}\}_{i\in[\![1,m]\!],j\in[\![1,n]\!]}\right|}$ .

Proof.

By the definition of $\operatorname{dec}(W)$ , we have

[TABLE]

Since $D_{i,j}$ is either [math] or $1$ , the right-hand side can yield at most $2^{\operatorname{wt}(\lambda)}$ different values, so that

[TABLE]

or $\operatorname{wt}(\lambda)\geq{\left\lceil\log_{2}s\right\rceil}$ .

On the other hand, every equation

[TABLE]

must have at least one positive $\lambda_{D}$ for some

[TABLE]

Since for every $i$ , the sets $\mathfrak{D}_{i,1}$ , $\mathfrak{D}_{i,2}$ , …, $\mathfrak{D}_{i,n}$ are mutually disjoint, we conclude that $\operatorname{wt}(\lambda)\geq\operatorname{wt}(W_{i,*})$ . ∎

Algorithm A.6.

*Let $f$ be an arbitrary one-to-one map of $[\![1,n^{m}]\!]$ onto $\mathcal{D}$ . The following algorithm with $W$ and $f$ as arguments can yield a vertex of $\operatorname{dec}(W)$ . *

function vertex( $W,f$ )

$\lambda\leftarrow 0$ , $K\leftarrow W$ , $i\leftarrow 1$

while $K\neq 0$ and $1\leq i\leq n^{m}$ do

$D\leftarrow f(i)$

$\lambda_{D}\leftarrow\min_{1\leq r\leq m}K_{r,D(r)}$

$K\leftarrow K-\lambda_{D}D$

$i\leftarrow i+1$

end while

return $\lambda$

end function

Sketch of Proof.

Let $\lambda$ be the vertex output by the algorithm. Let $S=\operatorname{supp}(\lambda)$ . Then by checking Algorithm A.6, it is easy to verify that for every $D\in S$ , there exists an $i\in[\![1,m]\!]$ such that $I_{D,(i,D(i))}=1$ and $I_{D^{\prime},(i,D(i))}=0$ for all $D^{\prime}\in S$ with $f^{-1}(D^{\prime})>f^{-1}(D)$ , so that $\operatorname{rank}(I_{S,*})={\left|S\right|}$ . ∎

Remark A.7.

We can replace the map $f$ in Algorithm A.6 with some one-to-one map $f^{\prime}\colon[\![1,\ell]\!]\to\mathcal{D}$ , where $1\leq\ell<n^{m}$ . Then we have a modified algorithm returning a pair $(\lambda^{\prime},K)$ such that

[TABLE]

Suppose the nontrivial case $K\neq 0$ , so that $\alpha=\sum_{D\in\mathcal{D}}\lambda^{\prime}_{D}<1$ . Let $W^{\prime}=K/(1-\alpha)$ . If we have another algorithm to find a vertex of $\operatorname{dec}(W^{\prime})$ , say $\lambda^{\prime\prime}$ , then it is easy to show that $\lambda=\lambda^{\prime}+(1-\alpha)\lambda^{\prime\prime}$ is a vertex of $\operatorname{dec}(W)$ .

Appendix B Properties of $J_{f}$ and $C_{f}$

This section provides some basic results on the analytic properties of $J_{f}$ and $C_{f}$ defined in Section 3. For any $p,p^{\prime}\in\mathcal{P}_{A}$ ,

[TABLE]

is called the statistical distance on $\mathcal{P}_{A}$ . Given the product space $(A,\operatorname{d}_{A})\times(B,\operatorname{d}_{B})$ , we define its product metric by

[TABLE]

which induces the usual product topology. Thus for any channels $W,W^{\prime}\in\mathcal{W}_{A,B}$ , we have the channel distance

[TABLE]

Proposition B.1.

(a) $J_{10}(\lambda,\mu)$ is uniformly continuous, and it is convex in $\lambda$ for fixed $\mu$ and is concave in $\mu$ for fixed $\lambda$ .

(b) $J_{01}(\lambda,\mu)$ is uniformly continuous, and it is linear in $\lambda$ for fixed $\mu$ and is concave in $\mu$ for fixed $\lambda$ .

Proof.

(a) The function $J_{10}(\lambda,\mu)$ can be rewritten as $I(\mu,g(\lambda))$ where

[TABLE]

with $V(u)=(D_{u(D),y})_{D\in\mathcal{D},y\in\mathcal{Y}}$ . By Proposition B.4, for $\lambda,\lambda^{\prime}\in\mathcal{P}_{\mathcal{D}}$ ,

[TABLE]

so that $g$ is uniformly continuous, and hence $J_{10}(\lambda,\mu)$ is uniformly continuous (Proposition B.6). It is also clear that $g$ is linear, so that $J_{10}(\lambda,\mu)$ is convex for fixed $\mu$ and is concave for fixed $\lambda$ ([8, Theorem 2.7.4]).

(b) The function $J_{01}(\lambda,\mu)$ can be written as $\lambda(g(\mu))^{\mathsf{T}}$ where $g(\mu)=(I(\mu,D))_{D\in\mathcal{D}}$ . By Propositions B.3 and B.4, $I(\mu,D)$ is uniformly continuous on $\mathcal{P}_{\mathcal{X}}$ and is bounded by $\log(|\mathcal{X}|\wedge|\mathcal{Y}|)$ . Then for $\lambda,\lambda^{\prime}\in\mathcal{P}_{\mathcal{D}}$ and $\mu,\mu^{\prime}\in\mathcal{P}_{\mathcal{X}}$ , we have

[TABLE]

which implies that $J_{01}$ is uniformly continuous. The remaining part is straightforward ([8, Theorem 2.7.4]). ∎

Proposition B.2.

For $f\in\{10,01,11\}$ , $\operatorname{C}_{f}(\lambda)$ is uniformly continuous and convex (and in fact linear for $f=11$ ).

Sketch of Proof.

Use Theorem B.1 and Proposition B.7 for $f=10$ or $01$ . The case of $f=11$ is trivial because $\operatorname{C}_{11}(\lambda)$ is a linear function of $\lambda$ . ∎

Proposition B.3 ([9, Theorem 2]).

For $\mu,\mu^{\prime}\in\mathcal{P}_{A}$ and $W,W^{\prime}\in\mathcal{W}_{A,B}$ ,

[TABLE]

where $\delta=\operatorname{d}(\operatorname{diag}(\mu)W,\operatorname{diag}(\mu^{\prime})W^{\prime})$ .

Proposition B.4 (cf. [10, Lemma 3]).

For $\mu,\mu^{\prime}\in\mathcal{P}_{A}$ and $W\in\mathcal{W}_{A,B}$ ,

[TABLE]

and

[TABLE]

Proposition B.5 (cf. [10, Lemma 3]).

For $\mu,\mu^{\prime}\in\mathcal{P}_{A}$ and $W,W^{\prime}\in\mathcal{W}_{A,B}$ ,

[TABLE]

so that $I(\mu,W)$ is uniformly continuous on $(\mathcal{P}_{A}\times\mathcal{W}_{A,B},\operatorname{d_{\vee}})$ .

Sketch of Proof.

Use the triangle inequality and Propositions B.3, B.4 and B.8. ∎

Proposition B.6.

Let $g$ be a map from $\mathcal{P}_{C}$ to $\mathcal{W}_{A,B}$ . If $g$ is uniformly continuous, then $I(\mu,g(\lambda))$ is uniformly continuous on $(\mathcal{P}_{A}\times\mathcal{P}_{C},\operatorname{d_{\vee}})$ , where $\mu\in\mathcal{P}_{A}$ and $\lambda\in\mathcal{P}_{C}$ .

Sketch of Proof.

Use Propsoitions B.3 and B.5 and the observation that $I(\mu,g(\lambda))$ is a composition of uniformly continuous maps. ∎

Proposition B.7.

If $g\colon A\times B\to\mathbf{R}$ is uniformly continuous on $(A\times B,\operatorname{d_{\vee}})$ , then $f(x)=\sup_{b\in B}g(x,b)$ is uniformly continuous.

Proof.

Since $g$ is uniformly continuous, for any $\epsilon>0$ , there is a $\delta>0$ such that for any $a,a^{\prime}\in A$ and any $b\in B$ , $\operatorname{d_{\vee}}((a,b),(a^{\prime},b))<\delta$ implies $|g(a,b)-g(a^{\prime},b)|<\epsilon$ . In other words, for any $b\in B$ , $\operatorname{d}_{A}(a,a^{\prime})<\delta$ implies $|g(a,b)-g(a^{\prime},b)|<\epsilon$ . Then

[TABLE]

and similarly, $\sup_{b\in B}g(a^{\prime},b)-\sup_{b\in B}g(a,b)<\epsilon$ , so that $f(x)$ is uniformly continuous. ∎

Proposition B.8.

For $\mu\in\mathcal{P}_{A}$ and $W,W^{\prime}\in\mathcal{W}_{A,B}$ ,

[TABLE]

Proof.

[TABLE]

∎

Appendix C Proofs and Examples of Section 4.1

Proof of Proposition 4.4.

Let $W=(P_{1}+\cdots+P_{\ell})/\ell$ and $j$ be the column such that $\operatorname{wt}(W_{*,j})>1$ . It is clear that $W=xD+(1-x)W^{\prime}$ for some $x\in(0,1)$ and some $D\in\mathcal{D}$ such that $\operatorname{wt}(D_{*,j})>1$ , so that $\operatorname{\underline{\operatorname{IC}}}_{11}(W)<\log m$ , and hence $P_{1}$ , …, $P_{\ell}$ are not $\operatorname{IC}$ -minimized. ∎

Proof of Proposition 4.5.

Let $W=(P_{1}+\cdots+P_{\ell})/\ell$ and $j$ be the column of which all entries are less than $1$ . It is clear that $W=xD+(1-x)W^{\prime}$ for some $x\in(0,1)$ and some $D\in\mathcal{D}$ such that $\operatorname{wt}(D_{*,j})=0$ , so that $\operatorname{\underline{\operatorname{IC}}}_{11}(W)<\log n$ , and hence $P_{1}$ , …, $P_{\ell}$ are not $\operatorname{IC}$ -minimized. ∎

Proof of Proposition 4.6.

With no loss of generality, we assume that $D_{m,j}=0$ . It is then clear that

[TABLE]

where

[TABLE]

and

[TABLE]

It is clear that $\operatorname{rank}(D^{\prime})\geq(\operatorname{rank}(D)-1)\vee 2$ and $\operatorname{rank}(D^{\prime\prime})=2$ , so that

[TABLE]

and therefore $\{D,\mathrm{U}_{j}\}$ is not $\operatorname{IC}$ -maximized. ∎

Example C.1.

If

[TABLE]

which is the probability transition matrix seen in the well-known random binning scheme, then $\operatorname{\underline{\operatorname{IC}}}_{11}(W)=0$ and $\operatorname{\overline{\operatorname{IC}}}_{11}(W)=\log(m\wedge n)$ (Theorems 3.3 and 3.4).

Example C.2.

[TABLE]

It can be computed using linear programming that $\operatorname{\underline{\operatorname{IC}}}_{11}(W)=0.4$ and $\operatorname{\overline{\operatorname{IC}}}_{11}(W)=0.2+0.8\log 3\approx 1.4680$ . The decompositions of $W$ for $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ are

[TABLE]

and

[TABLE]

respectively. Using Theorems 3.3 and 3.4 and Proposition 4.10, we have

[TABLE]

and

[TABLE]

From Proposition 4.4, it follows that the optimal decomposition $\lambda^{\prime}$ for $\operatorname{\underline{\operatorname{IC}}}_{11}(W^{\prime})$ can have at most one perfect channel, so that $\operatorname{\Gamma}_{\lambda^{\prime}}(3)\leq 0.25$ , where

[TABLE]

is computed by the formula in Theorem 3.3. Then we have an improved bound: $\operatorname{\underline{\operatorname{IC}}}_{11}(W)\leq 0.4\operatorname{\underline{\operatorname{IC}}}_{11}(W^{\prime})=0.3+0.1\log 3\approx 0.4585$ .

Appendix D Capacity-Achieving Input Probability Distributions

Let $W$ be a channel in $\mathcal{W}_{\mathcal{X},\mathcal{Y}}$ . According to [11, Theorem 4.5.1], an input probability distribution $\mu$ maximizes the mutual information $I(\mu,W)$ iff

[TABLE]

and

[TABLE]

where $\tau=\mu W$ . Based on this sufficient and necessary condition, we have the following results concerning the support of capacity-achieving input probability distributions. In the sequel, we denote by $\operatorname{conv}(V)$ the convex hull of all vectors in $V$ .

Proposition D.1.

Let $A\subseteq\mathcal{X}$ . If all row vectors of $W$ are contained in $\operatorname{conv}(\{W_{x,*}\}_{x\in A})$ , then there exists a capacity-achieving probability distribution $\mu$ such that $\operatorname{supp}(\mu)\subseteq A$ .

Proof.

Let $\nu$ be a capacity-achieving probability distribution of the sub matrix $W_{A,*}$ . Extending $\nu$ with zero values, we obtain a probability distribution $\mu$ over $\mathcal{X}$ . It is clear that

[TABLE]

and

[TABLE]

where $\tau=\mu W=\nu W_{A,*}$ . It remains to show that

[TABLE]

which is obvious, because

[TABLE]

for some nonnegative coefficients $(\alpha_{a})_{a\in A}$ with $\sum_{a\in A}\alpha_{a}=1$ . ∎

Proposition D.2.

Let $\mu$ be a capacity-achieving probability distribution of $W$ and let $A=\operatorname{supp}(\mu)$ . For any $a\in A$ and any $b\notin A$ , $W_{a,*}\notin\operatorname{conv}(\{W_{x,*}\}_{x\in A\cup\{b\}}\setminus\{W_{a,*}\})$ .

Proof.

It is clear that $D(W_{x,*}\|\tau)=C$ for all $x\in A$ , where $\tau=\mu W$ . We first show that $W_{a,*}\notin\operatorname{conv}(\{W_{x,*}\}_{x\in A}\setminus\{W_{a,*}\})$ , which corresponds to the case $W_{b,*}=W_{a,*}$ . If it is false, then

[TABLE]

where $A^{\prime}={\left\{x\in A\colon W_{x,*}\neq W_{a,*}\right\}}$ , $\alpha_{x}\geq 0$ , and $\sum_{x\in A^{\prime}}\alpha_{x}=1$ . It is clear that $\alpha_{x}<1$ for all $x\in A^{\prime}$ , so that

[TABLE]

a contradiction. Now suppose that

[TABLE]

for some $b\notin A$ with $W_{b,*}\neq W_{a,*}$ . Let $A^{\prime\prime}=A^{\prime}\cup\{b\}$ . Then

[TABLE]

where $\alpha_{x}\geq 0$ and $\sum_{x\in A^{\prime\prime}}\alpha_{x}=1$ . It is clear that $0<\alpha_{b}<1$ , and therefore

[TABLE]

so that $D(W_{b,*}\|\tau)>C$ , which is absurd. ∎

Appendix E Counterexamples for Section 4.2

Example E.1.

$\operatorname{\underline{\operatorname{IC}}}_{10}(W)>\operatorname{C}(W)$ * for*

[TABLE]

Proof.

Let $S={\left\{D\in\mathcal{D}\colon D(1)\in\{1,2\},D(2)=3\right\}}$ . It is then clear that, for every $\lambda\in\operatorname{dec}(W)$ ,

[TABLE]

If we define the map $u\colon\mathcal{D}\to[\![1,2]\!]$ by

[TABLE]

then the row vector $v=(\sum_{D}\lambda_{D}D_{u(D),y})_{y\in[\![1,3]\!]}$ is always on the line segment $L$ with endpoints $(0.65,0.35,0)$ and $(0.6,0.4,0)$ .

By numerical computation, we know that

[TABLE]

where

[TABLE]

and

[TABLE]

is the capacity-achieving input distribution of $W$ . Furthermore, it can be verified that all points $x$ of $L$ satisfy

[TABLE]

This implies that $\mu$ , if extended to $[\![1,2]\!]^{\mathcal{D}}$ , cannot be a capacity-achieving distribution ([11, Theorem 4.5.1]). In other words, for every $\lambda\in\operatorname{dec}(W)$ , the intrinsic capacity $\operatorname{C}_{10}(\lambda)>\operatorname{C}(W)$ , so that $\operatorname{\underline{\operatorname{IC}}}_{10}(W)>\operatorname{C}(W)$ . ∎

Example E.2.

Let state alphabet $\mathcal{S}=[\![1,2]\!]$ and let

[TABLE]

where

[TABLE]

and

[TABLE]

It is easy to show that $\mu=(0.603123,0.396877)$ is the capacity-achieving input distribution for $W$ , so that the output distribution is

[TABLE]

and $D(\delta\|\tau)=D(\gamma\|\tau)\approx 0.0238286$ . However, for the channel $V\colon[\![1,2]\!]^{\mathcal{S}}\to[\![1,3]\!]$ given by

[TABLE]

if we choose the map $u(s)=s$ , then the corresponding row vector

[TABLE]

and $D(\zeta\|\tau)\approx 0.0246518>D(\gamma\|\tau)$ . This implies that $\mu$ , if extended to $[\![1,2]\!]^{\mathcal{S}}$ , cannot be a capacity-achieving distribution for $V$ ([11, Theorem 4.5.1]). In other words, the capacity of $W$ can be increased by the causal state information $S$ at the encoder.

Appendix F $\operatorname{\underline{\operatorname{IC}}}_{01}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{01}(W)$

Proof of Proposition 3.6.

Because $m=2$ , the binary uniform distribution is capacity-achieving for every deterministic channel, rank $1$ or rank $2$ . Thus we have $\operatorname{C}_{01}(\lambda)=\operatorname{C}_{11}(\lambda)$ for every $\lambda\in\operatorname{dec}(W)$ . The remaining part is an easy consequence of Propositions 3.3 and 3.4. ∎

Proposition F.1.

Let $W$ be a channel $[\![1,3]\!]\to[\![1,2]\!]$ . If all probabilities $W_{i,j}$ are distinct and the sum of each column of $W$ is greater than or equal to $1$ , then $\operatorname{\overline{\operatorname{IC}}}_{01}(W)<\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ .

Proof.

By Proposition 4.10, $\operatorname{\overline{\operatorname{\Gamma}}}_{W}(2)=1$ , so that $W$ can be expressed as a convex combination of perfect channels and hence $\operatorname{\overline{\operatorname{IC}}}_{11}(W)=1$ .

Let

[TABLE]

If $\operatorname{\overline{\operatorname{IC}}}_{01}(W)=1$ , then there exists a $\lambda\in S$ such that the capacity-achieving input distribution, denoted $\mu$ , is capacity-achieving for every perfect channel $D\in\operatorname{supp}(\lambda)$ . Thus at least one entry of $\mu$ must be $1/2$ . With no loss of generality, we assume $\mu_{1}=1/2$ .

If $\mu_{2}$ and $\mu_{3}$ are both positive, then $\mu$ is capacity-achieving only for perfect channels

[TABLE]

By Proposition A.5, every $\lambda\in\operatorname{dec}(W)$ satisfies $\operatorname{supp}(\lambda)\geq{\left\lceil\log_{2}6\right\rceil}=3$ , which implies that $\mu$ is not capacity-achieving for $\lambda\in S$ .

If $\mu_{2}=0$ , then $\mu$ is capacity-achieving for perfect channels

[TABLE]

However, any convex combination of these four matrices can only yield a channel matrix with at most four distinct probability values, and hence $\mu$ is not capacity-achieving for $\lambda\in S$ .

In all cases, we have shown that $\mu$ is not capacity-achieving, which contradicts the assumption $\operatorname{\overline{\operatorname{IC}}}_{01}(W)=1$ . Therefore, we have $\operatorname{\overline{\operatorname{IC}}}_{01}(W)<1=\operatorname{\overline{\operatorname{IC}}}_{11}(W)$ . ∎

Bibliography11

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. A. El Gamal and Y.-H. Kim, Network Information Theory . Cambridge; New York: Cambridge University Press, 2011.
2[2] J. Wang, J. Chen, L. Zhao, P. Cuff, and P. Haim, “On the role of the refinement layer in multiple description coding and scalable coding,” IEEE Trans. Inf. Theory , vol. 57, no. 3, pp. 1443–1456, Mar. 2011.
3[3] H. Nikaidô, “On von Neumann’s minimax theorem,” Pacific Journal of Mathematics , vol. 4, no. 1, pp. 65–72, Mar. 1954.
4[4] W. Jurkat and H. Ryser, “Extremal configurations and decomposition theorems. I,” Journal of Algebra , vol. 8, no. 2, pp. 194–222, Feb. 1968.
5[5] R. M. Caron, X. Li, P. Mikusiński, H. Sherwood, and M. D. Taylor, “Nonsquare “doubly stochastic” matrices,” in Institute of Mathematical Statistics Lecture Notes - Monograph Series . Hayward, CA: Institute of Mathematical Statistics, 1996, pp. 65–75.
6[6] R. Xu, J. Chen, T. Weissman, and J.-K. Zhang, “When is noisy state information at the encoder as useless as no information or as good as noise-free state?” IEEE Trans. Inf. Theory , vol. 63, no. 2, pp. 960–974, Feb. 2017.
7[7] D. P. Bertsekas, A. Nedić, and A. E. Ozdaglar, Convex Analysis and Optimization , ser. Athena Scientific optimization and computation. Belmont, Mass: Athena Scientific, 2003, no. 1.
8[8] T. M. Cover and J. A. Thomas, Elements of Information Theory , 2nd ed. Hoboken, N.J: Wiley-Interscience, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Intrinsic Capacity

Abstract

1 Introduction

2 Notations

3 Definitions and Main Results

Remark 3.1**.**

Definition 3.2**.**

Theorem 3.3**.**

Theorem 3.4**.**

Theorem 3.5**.**

Proposition 3.6**.**

Example 3.7**.**

4 Proofs of Main Results

4.1 IC⁡‾⁡11(W)\operatorname{\underline{\operatorname{IC}}}_{11}(W)IC​11​(W) and IC⁡‾⁡11(W)\operatorname{\overline{\operatorname{IC}}}_{11}(W)IC11​(W)

Proposition 4.1**.**

Proof.

Definition 4.2**.**

Proposition 4.3**.**

Proposition 4.4**.**

Proposition 4.5**.**

Proposition 4.6**.**

Theorem 4.7**.**

Proof.

Proposition 4.8**.**

Proof.

Proposition 4.9**.**

Proof.

Proposition 4.10**.**

Proof.

Proof of Theorem 3.3.

Proof of Theorem 3.4.

Conjecture 4.11**.**

4.2 IC⁡‾⁡10(W)\operatorname{\underline{\operatorname{IC}}}_{10}(W)IC​10​(W) and IC⁡‾⁡10(W)\operatorname{\overline{\operatorname{IC}}}_{10}(W)IC10​(W)

Proof of Theorem 3.5.

Theorem 4.12**.**

Proof.

Conjecture 4.13**.**

Definition 4.14**.**

Theorem 4.15**.**

Proof.

5 Conclusion

Appendix A The Structure of dec⁡(W)\operatorname{dec}(W)dec(W)

Theorem A.1**.**

Proof.

Proposition A.2**.**

Proof.

Proposition A.3**.**

Proposition A.4**.**

Sketch of Proof.

Proposition A.5**.**

Proof.

Algorithm A.6**.**

Sketch of Proof.

Remark A.7**.**

Appendix B Properties of JfJ_{f}Jf​ and CfC_{f}Cf​

Proposition B.1**.**

Proof.

Proposition B.2**.**

Sketch of Proof.

Proposition B.3** ([9, Theorem 2]).**

Proposition B.4** (cf. [10, Lemma 3]).**

Proposition B.5** (cf. [10, Lemma 3]).**

Sketch of Proof.

Proposition B.6**.**

Sketch of Proof.

Proposition B.7**.**

Proof.

Proposition B.8**.**

Proof.

Appendix C Proofs and Examples of Section 4.1

Proof of Proposition 4.4.

Proof of Proposition 4.5.

Proof of Proposition 4.6.

Example C.1**.**

Remark 3.1.

Definition 3.2.

Theorem 3.3.

Theorem 3.4.

Theorem 3.5.

Proposition 3.6.

Example 3.7.

4.1 $\operatorname{\underline{\operatorname{IC}}}_{11}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{11}(W)$

Proposition 4.1.

Definition 4.2.

Proposition 4.3.

Proposition 4.4.

Proposition 4.5.

Proposition 4.6.

Theorem 4.7.

Proposition 4.8.

Proposition 4.9.

Proposition 4.10.

Conjecture 4.11.

4.2 $\operatorname{\underline{\operatorname{IC}}}_{10}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{10}(W)$

Theorem 4.12.

Conjecture 4.13.

Definition 4.14.

Theorem 4.15.

Appendix A The Structure of $\operatorname{dec}(W)$

Theorem A.1.

Proposition A.2.

Proposition A.3.

Proposition A.4.

Proposition A.5.

Algorithm A.6.

Remark A.7.

Appendix B Properties of $J_{f}$ and $C_{f}$

Proposition B.1.

Proposition B.2.

Proposition B.3 ([9, Theorem 2]).

Proposition B.4 (cf. [10, Lemma 3]).

Proposition B.5 (cf. [10, Lemma 3]).

Proposition B.6.

Proposition B.7.

Proposition B.8.

Example C.1.

Example C.2.

Proposition D.1.

Proposition D.2.

Example E.1.

Example E.2.

Appendix F $\operatorname{\underline{\operatorname{IC}}}_{01}(W)$ and $\operatorname{\overline{\operatorname{IC}}}_{01}(W)$

Proposition F.1.