Intrinsic Capacity
Shengtian Yang, Rui Xu, Jun Chen, Jian-Kang Zhang

TL;DR
This paper investigates the capacity limits of channels with intrinsic states when causal state information is available at the encoder and/or decoder, providing new theoretical insights and specific results for binary channels.
Contribution
It introduces a framework for analyzing channel capacities with intrinsic states and causal information, including generalizations of key theorems and conditions for the usefulness of state information.
Findings
Maximum and minimum capacities for binary channels are characterized.
A generalization of the Birkhoff-von Neumann theorem is presented.
Conditions under which causal state information is useless are identified.
Abstract
Every channel can be expressed as a convex combination of deterministic channels with each deterministic channel corresponding to one particular intrinsic state. Such convex combinations are in general not unique, each giving rise to a specific intrinsic-state distribution. In this paper we study the maximum and the minimum capacities of a channel when the realization of its intrinsic state is causally available at the encoder and/or the decoder. Several conclusive results are obtained for binary-input channels and binary-output channels. Byproducts of our investigation include a generalization of the Birkhoff-von Neumann theorem and a condition on the uselessness of causal state information at the encoder.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Intrinsic Capacity
Shengtian Yang, Rui Xu, Jun Chen, Jian-Kang Zhang00footnotetext: This work was supported in part by the National Natural Science Foundation of China under Grant 61571398 and in part by the Natural Science and Engineering Research Council (NSERC) of Canada under a Discovery Grant. This paper is to be presented in part at the 2017 IEEE International Symposium on Information Theory.
00footnotetext: S. Yang is with the School of Information and Electronic Engineering, Zhejiang Gongshang University, Hangzhou 310018, China, and was also with the Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4K1, Canada (e-mail: [email protected]).
00footnotetext: R. Xu, J. Chen, and J.-K. Zhang are with the Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4K1, Canada (e-mail: [email protected]; [email protected] [email protected]).
Abstract
Every channel can be expressed as a convex combination of deterministic channels with each deterministic channel corresponding to one particular intrinsic state. Such convex combinations are in general not unique, each giving rise to a specific intrinsic-state distribution. In this paper we study the maximum and the minimum capacities of a channel when the realization of its intrinsic state is causally available at the encoder and/or the decoder. Several conclusive results are obtained for binary-input channels and binary-output channels. Byproducts of our investigation include a generalization of the Birkhoff-von Neumann theorem and a condition on the uselessness of causal state information at the encoder.
Index Terms — Birkhoff-von Neumann theorem, channel capacity, deterministic channel, state information.
1 Introduction
A discrete channel is commonly viewed as a black box with the input-output relation characterized by a stochastic matrix. In practice, it is often possible to obtain some additional information (known as the state information) by probing the channel. The knowledge of the state information might be useful in increasing the channel capacity. Note that, given each state, the channel can again be viewed as a black box and can potentially be further probed. One may continue this process until the black box is fully opened, i.e., the channel becomes deterministic given the acquired state information. This line of thought suggests that every channel has its own intrinsic state, which fully captures the randomness of the channel, and any state information acquired via channel probing is a degenerate version of this intrinsic state. As such, the intrinsic capacity, defined as the capacity of a channel when its intrinsic state is revealed, determines the ultimate capacity gain one can hope for by probing the channel.
It turns out that the intrinsic capacity of a channel is not necessarily uniquely defined. Consider a binary symmetric channel with crossover probability : where each entry denoting the conditional probability of output given input . The capacity of is clearly zero. For this channel, we consider the following two models:
[TABLE]
where denotes the modulo- addition and is uniformly distributed over . It is easy to verify that they both have the conditional probability distribution . If the actual model of is , then for every realization of , becomes a deterministic perfect channel, or so that the capacity of with available at the encoder and/or the decoder increases to one. On the other hand, if the actual model of is , then for every realization of , becomes a deterministic useless channel, or and hence, even with known at both sides, the capacity of is still zero. In fact, it will be seen that, for every number , one can find a model for such that the resulting intrinsic capacity is .
This example indicates that a channel may admit different decompositions into deterministic channels. All these decompositions are mathematically legitimate though the actual way the deterministic channels are mixed to produce the given channel depends on the underlying physical mechanism. In this work we study the minimum and the maximum intrinsic capacities of a channel over all admissible decompositions. They will be referred to as the lower intrinsic capacity and the upper intrinsic capacity. For the aforementioned channel , its lower and upper intrinsic capacities are 0 and 1, respectively. Since the causal state information may be available at the encoder, the decoder, or both, there are totally three different notions of lower and upper intrinsic capacities of a channel , denoted by and , for , where the two bits indicate if the state information is available at the encoder and the decoder, respectively.
The main contributions of this work are:
We study the structure of the convex polytope , which consists of all convex combinations of deterministic channels for channel , with a particular focus on its vertices. It is shown that for all and are attained at certain vertices of (Theorem A.1).
We prove a generalization of the Birkhoff-von Neumann theorem for a family of channel matrices with integer-valued column-sum vector constraints and from below and above, respectively (Theorem 4.7). It is shown that is convex and its vertices are exactly all deterministic channels in . Using this fundamental result, we determine the exact values of and when the input or the output is binary. General lower and upper bounds are further provided for the nonbinary cases (Theorems 3.3 and 3.4), and in some cases, the exact value of is also determined.
We obtain the exact values of and when is a binary-output channel (Theorem 3.5), and obtain the exact values of and (Proposition 3.6) when is a binary-input channel. An interesting phenomenon observed is that for binary-output , where denotes the capacity of . In other words, every binary-output channel can be generated through a certain mechanism such that the capacity remains the same if the source of randomness is causally revealed to the encoder. We further prove that the causal state information at the encoder is useless for a broad class of channels (Theorem 4.12). Finally, by providing some counterexamples, we show that the results such as and are specific to binary-input or binary-output channels, and do not hold in general (Example E.1 and Proposition F.1).
The rest of this paper is organized as follows. Section 2 lists some common notations used throughout this paper. Section 3 provides the definitions of various notions of (lower/upper) intrinsic capacity and a summary of the main results of this paper. The proofs and some other relevant findings are presented in Section 4 and the appendices.
2 Notations
Although most notations will be defined at their first occurrences, some common ones are listed here for easy reference.
The set of integers in the interval .
The set of all maps , or equivalently, the set of all indexed families (a generalized form of sequences). If , then degenerates to the Cartesian product . In this paper, a vector (for example, in ) will be regarded as a row vector, and an all- vector is usually denoted by .
The minimum of and .
The maximum of and .
The support set of .
The weight of .
The largest integer . If the argument is a sequence , then . The same convention also applies to other functions such as , , , and .
The smallest integer .
.
.
.
3 Definitions and Main Results
Let and be two finite sets. A channel is a stochastic matrix with each entry , or conventionally, denoting the probability of output given input . A deterministic channel is a special channel whose stochastic matrix is a zero-one matrix, as such it uniquely identifies a map of into . In the sequel, deterministic channels and maps will be regarded as equivalent objects and denoted using the same notation.
It is clear that the set of all channels forms a convex polytope in . We denote this polytope by , or succinctly, . The deterministic channels are exactly the vertices of , and every channel can be expressed as a convex combination of them. This simple observation suggests that, for any channel, one can define a random state variable (referred to as the intrinsic state) given which the channel becomes deterministic. We are interested in characterizing the capacity of a channel when its intrinsic state is available at the encoder and/or the decoder. Such capacity results are of fundamental importance since they delineate the potential gain that can be achieved by probing the channel.
For a given channel, there are often multiple ways to write it as a convex combination of deterministic channels; as a consequence, the distribution of its intrinsic state is in general not uniquely defined. Let (or simply ) denote the set of all deterministic channels . Then the set of all possible convex decompositions of a channel is given by
[TABLE]
where is the set of all probability distributions over and can be regarded as the set of matrices or vectors. For each intrinsic-state distribution , we define the resulting capacities when the intrinsic state is causally available at the encoder, the decoder, or both, by
[TABLE]
respectively (see [1, Chapter 7]), where
[TABLE]
and the flag indicates the availability of the intrinsic state at the encoder and the decoder. For example, means that the intrinsic state is available at the encoder but not at the decoder. For completeness, we also define the capacity with no encoder and decoder side information:
[TABLE]
Then, given a channel , we can define its intrinsic-capacity set by
[TABLE]
Furthermore, we define the lower intrinsic capacity and the upper intrinsic capacity of for by
[TABLE]
and
[TABLE]
respectively.
Remark 3.1**.**
Using the functional representation lemma [1, p. 626][2, Lemma 1], it can be easily shown that provides an upper bound on the capacity of with any form of state information whose availability at the encoder and the decoder is specified by . On the other hand, from the minimax theorem [3], Proposition B.1, and [1, Theorems 7.1 and 7.2, Eqs. (7.2) and (7.3), and Remark 7.6], it follows that is exactly the capacity of the compound channel with the availability of at the encoder and the decoder specified by , where is -valued, i.e., a random deterministic channel, and is selected arbitrarily from .
The main results of this paper are given as follows. With no loss of generality, we assume from now on that the channel is from to , where .
Definition 3.2**.**
Let
[TABLE]
be the rank probability function over induced by . The lower and the upper rank- probabilities of are then defined by
[TABLE]
respectively.
Bounds for and when and are given by Propositions 4.8 and 4.10, respectively. Most of our results will be expressed in terms of these quantities.
Theorem 3.3**.**
[TABLE]
[TABLE]
where
[TABLE]
[TABLE]
[TABLE]
and is the deterministic useless channel with its -th column being all one.
If or , then .
Theorem 3.4**.**
If or or , then
[TABLE]
otherwise,
[TABLE]
where
[TABLE]
[TABLE]
[TABLE]
If and , then .
If and , then .
Theorem 3.5**.**
If , then
[TABLE]
and
[TABLE]
Proposition 3.6**.**
If , then for every , , so that and .
The above results enable us to obtain explicit characterizations of all lower and upper intrinsic capacities for binary-input binary-output channels. The relevant expressions are collected in the following example.
Example 3.7**.**
If and
[TABLE]
then
[TABLE]
[TABLE]
[TABLE]
[TABLE]
where is the binary entropy function. If is a binary symmetric channel with crossover probability (i.e., ), then
[TABLE]
[TABLE]
[TABLE]
If is a Z-channel with crossover probability (i.e., and ), then
[TABLE]
[TABLE]
The case of Z-channel is special, because in this case admits a unique convex decomposition into deterministic channels:
[TABLE]
The lower and the upper intrinsic capacities of these two special channels are plotted in Figs. 1 and 2.
4 Proofs of Main Results
It is clear that is bounded, closed, and convex, so it can be easily shown that is a closed interval and that for all and are attained at certain vertices of (Theorem A.1). As such, it is of great importance to study the structure of . A series of results on the vertices of is provided in Appendix A. Although these results shed useful light on the structure of , the characterizations are still too coarse for our purpose. It will be seen that additional insights can be gained by taking the objective functions into consideration.
4.1 and
We first provide a complete characterization of that achieves or .
Proposition 4.1**.**
Let
[TABLE]
and
[TABLE]
For , iff there is no such that ; iff there is no such that .
Proof.
It suffices to prove the first part, because the second part can be proved in the same vein.
(Sufficiency) If there exists some such that , then
[TABLE]
and
[TABLE]
so that and , a contradiction.
(Necessity) For , if , then there is a vector such that , , and . Let . For sufficiently small , it can be verified that and , which is absurd. ∎
Definition 4.2**.**
A subset is said to be -minimized, or succinctly, -minimized (resp., -maximized) if there is a such that and (resp., ), where .
A simple consequence of Proposition 4.1 is:
Proposition 4.3**.**
If is -minimized (resp., -maximized), then any supported on achieves (resp., ), where . As a consequence, any nonempty subset of is also -minimized (resp., -maximized).
By Proposition 4.3, it is important to identify patterns of sets that are not -minimized or -maximized. Some simple patterns that are not -minimized or -maximized are given as follows and their proofs are relegated to Appendix C.
Proposition 4.4**.**
If , then any deterministic perfect channels , …, such that at least one column of has a weight greater than one are not -minimized.
Proposition 4.5**.**
If , then any deterministic perfect channels , …, such that at least one column of has no entry equal to are not -minimized.
Proposition 4.6**.**
For , if , then is not -maximized.
The next result is a generalization of the Birkhoff-von Neumann theorem, which plays a crucial role in proving Theorems 3.3 and 3.4. Our proof hinges on an extension of the ideas in [4, 5].
Theorem 4.7**.**
Let and be two -dimensional integer-valued vectors such that , namely, for . Let
[TABLE]
and , where denotes the -dimensional all-one row vector. If is not empty, then is convex and the vertices of are exactly the matrices in .
Proof.
It is clear that , if nonempty, is a convex set. We will show that any matrix with non-integer entries cannot be a vertex of . There are two cases:
Case (a): There is a non-integer entry in a non-boundary column.
Case (b): All non-integer entries are in the boundary columns.
Here, a column is called a boundary column if its sum is either or , where is the index of the column.
In whichever the case, we can pick a non-integer entry, say the entry, which in Case (a) must be a non-integer entry in a non-boundary column. By the following argument, we will find a chain or loop of non-integer entries of the matrix, which will be used to prove that the matrix is not extremal.
Because the entry is not an integer, there exists at least another entry in the same row that is also not an integer, say the entry. If the -th column is not on the boundary, then we are done. If however the -th column is on the boundary, then there exists at least another non-integer entry in the same column, say . In general, after steps, we have visited columns, with the chain
[TABLE]
Except for the -th column, every column has exactly one inbound entry and one outbound entry , where . Now in the -th step, by the same argument, we find the entry in the -th column. If this column has already been visited, then for some and we are done. If this column is new but not on the boundary, we are also done. If however this new column is on boundary, then we can further find an outbound entry in this column, say , and proceed to the -th step. Because there are finite columns, we will always end up with a chain
[TABLE]
which only happens in Case (a), or a loop
[TABLE]
for some .
Then we can construct a matrix by setting all outbound entries (in the chain or the loop) , all inbound entries , and all other entries to be zero. It is clear that
[TABLE]
in the former case and
[TABLE]
in the latter case, where .
Let and . It is clear that for sufficiently small . It is also clear that and , that is, is not a vertex of .
Therefore, we have , where denotes the set of all vertices of . It remains to show that . For any , if with and , then for every ,
[TABLE]
which however implies that for every , or . ∎
Equipped with Theorem 4.7, we proceed to derive bounds for the lower and the upper rank probabilities (Definition 3.2). These bounds are useful in estimating the lower and the upper intrinsic capacities.
Proposition 4.8**.**
[TABLE]
[TABLE]
where
[TABLE]
[TABLE]
Proof.
By Theorem 4.7, can be expressed as a convex combination of deterministic channels of rank if , in which case, . Otherwise, let be the index of the column with the sum . Consider the convex combination
[TABLE]
It is clear that cannot be a convex combination of deterministic channels of rank unless the sum of its -th column is . To this end, we set , which is the minimum value required, and we have
[TABLE]
and
[TABLE]
for , so that .
If has the following convex decomposition
[TABLE]
then is a valid stochastic matrix iff for all . Therefore, . ∎
Proposition 4.9**.**
If achieves , then . In particular, if , then and , where .
Proof.
If is zero on all deterministic useless channels, then .
If for some , then must be zero on all deterministic channels whose -th column weight is less than (Propositions 4.3 and 4.6). Therefore, we must have (Proposition 4.8) and . ∎
Proposition 4.10**.**
If , then
[TABLE]
where
[TABLE]
and
[TABLE]
Furthermore, if , then .
If , then
[TABLE]
where
[TABLE]
If , then .
Proof.
If , then the sum of every column of a deterministic channel of rank is at most 1, and for every , admits a convex decomposition into deterministic channels with the -th column sum at most . Thus for every and every ,
[TABLE]
so that
[TABLE]
for and hence . If , which implies that for all , then (Theorem 4.7).
If , then the sum of every column of a deterministic channel of rank is at least , so that, for every and every ,
[TABLE]
and hence . If , which implies for all , then (Theorem 4.7). ∎
We are now ready to prove Theorems 3.3 and 3.4.
Proof of Theorem 3.3.
To find an upper bound of , we need to find a convex decomposition of as “bad” as possible. To this end, we can first extract from a collection of useless channels with the total probability (Proposition 4.8), that is,
[TABLE]
If , then ; otherwise,
[TABLE]
It is clear that , where denotes the all- row vector. The best deterministic channels in are those with the number of nonzero columns maximized. The rank of those matrices is
[TABLE]
so (Theorem 4.7).
Let be a vertex of that attains . Then
[TABLE]
Finally, the special case of or can be easily verified. ∎
Proof of Theorem 3.4.
Let be a vertex of that attains .
If or or , then for all (Proposition 4.9), so that (Proposition 4.8). The remaining case is then .
To find a lower bound of , we need to find a convex decomposition of as “good” as possible. It is clear that , so is bounded below by the capacity of the worst deterministic channel in (Theorem 4.7), which are obviously those with the number of nonzero columns minimized. The capacity of such a channel is , so that .
On the other hand,
[TABLE]
where . The remaining part of the proof is straightforward. ∎
The bounds given by Theorems 3.3 and 3.4 can be improved in various ways. In Theorem 3.3, if , then the upper bound for in Proposition 4.10 can be used to improve the upper bound for ; if , the upper bound for can be improved by Proposition 4.4 (see Example C.2). The lower bound for can also be improved by because . However, all these improvements are somewhat ad hoc. The fundamental problem to be solved is how we can choose in order to approach or achieve the lower or the upper intrinsic capacities. In particular, based on Theorems 3.3, we have the following conjecture:
Conjecture 4.11**.**
For , if , then .
4.2 and
Although it is difficult to compute and in general, their exact values can be determined in the binary-output case, as is shown by Theorem 3.5.
Proof of Theorem 3.5.
Since , we only need to choose two maps from all the maps of into for constructing the capacity-achieving distributions. We denote these two maps by and . The optimal strategy for choosing is to maximize and minimize , where . There are only two classes of deterministic channels in , rank and rank . For of rank , it does not matter how to choose the values of and . For of rank , however, we choose such that and choose such that . Then we have
[TABLE]
and
[TABLE]
By Proposition 4.8, the maximum of is with each being the maximum of feasible values of , so that
[TABLE]
Observing that these two rows are exactly those of , we further have . Again by Proposition 4.8, the minimum of is . With no loss of generality, we suppose . Then the minima of feasible values of and are and [math], respectively, so that
[TABLE]
The fact that for binary-output channels is quite intriguing (although it is not true in general when the output is non-binary (Example E.1)). It implies that every binary-output channel can be simulated in a certain way that the capacity cannot be increased even when the encoder has causal access to the source of randomness, i.e., the intrinsic state. The following result shows that, in fact for a fairly broad class of channels, the causal state information at the encoder is useless as far as the capacity is concerned.
Theorem 4.12**.**
Let , where is a channel with binary output and is a channel with binary input and . Suppose
[TABLE]
where denotes the channel state and is its distribution. The capacity of cannot be increased by the causal state information at the encoder iff all with are -ended for some fixed and , where a binary output channel is said to be -ended if and . In other words, all row vectors of are contained in the line segment from endpoint to endpoint .
Proof.
(Sufficiency) By [1, Theorem 7.2 and Remark 7.6], we consider the channel given by and
[TABLE]
Because every channel is -ended, it is easy to show that is also -ended, where and are regarded as two constant maps from to . Then every row vector of is contained in the line segment between and , which implies that has a capacity-achieving input probability distribution supported on (Proposition D.1), and consequently the capacity of cannot be increased by the causal state information at the encoder.
(Necessity) If the capacity of cannot be increased by its causal state information at the encoder, then a capacity-achieving input probability distribution of must have a support, say , so that for every map , the vector
[TABLE]
is contained in the line segment between and (Proposition D.2), where and are understood as two constant maps from to . With no loss of generality, we assume . For any and any , we can take and for , then we get , so that . Similarly, we have . Therefore, every is -ended. ∎
It can be shown via a perturbation and continuity argument that the uselessness of the causal state information at the encoder is not restricted to the channels covered by Theorem 4.12. However, we have not been able to identify a simple explicit condition under which the sufficiency part of Theorem 4.12 can be extended. For example, consider a seemingly natural condition postulated by the following conjecture.
Conjecture 4.13**.**
Let be a channel from to . Suppose
[TABLE]
where denotes the state of channel. If for every , and have an order (either or ) independent of , then the capacity of cannot be increased by the causal state information available at the encoder.
This conjecture is obviously true for . Numerical results indicate that it also holds in many cases when . However it turns out to be false in general as shown by Example E.2.
Theorem 4.12 imposes no restriction on the distribution of the channel state. This universal property motivates us to introduce the following definition.
Definition 4.14**.**
The state information of a channel is said to be universally useless at the encoder if for any , the capacity of with causally available at the encoder is equal to the capacity of .
This definition is not void in view of Theorem 4.12 (in fact, according to our numerical results, many channels not covered by Theorem 4.12 also satisfy this definition). Now consider the channel model shown in Fig. 3, where the channel state is distributed according to , and (noisy) state observations and generated by through are causally available at the encoder and the decoder, respectively. Let denote the capacity of this channel model.
It is instructive to study the following example (see also Fig. 4) where
[TABLE]
For this example, we assume that is a binary symmetric channel with crossover probability , and is a binary symmetric channel with crossover probability ; furthermore, we assume that is physically degraded with respect to when , and the other way around when . To gain a better understanding, we plot against for in Fig. 5. It turns out that, somewhat counterintuitively, is maximized when the encoder side information coincides with the decoder side information (i.e., ) rather than when the encoder has access to the perfect state information (i.e., ). As shown by the following theorem, this is in fact a general phenomenon for any channel whose state information is universally useless at the encoder.
Theorem 4.15**.**
If the state information of is universally useless at the encoder, then is maximized when almost surely (assuming is fixed but can be arbitrary).
Proof.
It is clear that among all possible forms of encoder side information , is maximized when (since any other form of can be viewed as its degenerate version), i.e.,
[TABLE]
Note that
[TABLE]
where (a) follows from the universal-uselessness property of the state information of , and the constant means no information. This completes the proof. ∎
Roughly speaking, Theorem 4.15 implies that, for the class of channels satisfying Definition 4.14, what the encoder really needs to know is not the state information, but the decoder’s knowledge of the state information; in other words, for such channels, it is important to maintain consensus between the encoder and the decoder. It is also worth noting that Theorem 4.15 reduces to Definition 4.14 when there is no decoder side information.
Another surprising phenomenon revealed by Fig. 5 is that, as moves away from , the capacity not only decreases but actually drops to the value corresponding to the no encoder side information case once passes certain thresholds. Again, such a phenomenon is not confined to that specific example. An investigation of this phenomenon in the context where the encoder side information is a degenerate version of the decoder side information can be found in [6].
Similar to Theorem 3.5, we can also determine the exact values of and when the input is binary. In this case, we have for all , so that and (see Proposition 3.6 and Appendix F). The general case of and is however quite difficult. Currently, we only know that does not hold in general (Proposition F.1).
5 Conclusion
We have studied the lower and the upper intrinsic capacities of a channel , denoted by and , for three different scenarios () in terms of the availability of the causal state information at the encoder and/or the decoder. Their values are determined in almost all cases when the input or the output are binary, with only two exceptions (which are the binary-input nonbinary-output channels for and the nonbinary-input binary-output channels for ). A deeper understanding of the relevant optimization problems (especially the structure of ) is needed for further progress.
The lower and the upper intrinsic capacities are inherent properties of a channel with clear operational meanings. In particular, they characterize the potential capacity gains that can be achieved with a direct access to the generator of channel randomness by the encoder and/or the decoder. More generally, the notion of intrinsic capacity provides a useful perspective for studying the values of encoder and decoder side information. For example, our analysis of reveals that for a broad class of channels, the capacity is not necessarily maximized when the encoder has access to the perfect state information. We believe that this surprising finding is just the tip of the iceberg, and this line of research can be fruitfully pursued to uncover many previously unknown phenomena.
Appendix A The Structure of
Theorem A.1**.**
The set is a bounded, closed convex polytope. For each , is a closed interval and can be attained at some vertex of . Furthermore, can also be attained at some vertex of .
Proof.
By definition, it is clear that is a bounded, closed convex polytope, so that is a closed interval (Proposition B.2). It is also easy to see that attains its maximum at some vertex of and that attains its minimum at some vertex of (Proposition B.2 and [7, Proposition 3.4.1]). ∎
In light of Theorem A.1, we proceed to study the structure of with a focus on its vertices. Our approach is analogous to [4].
Proposition A.2**.**
Let
[TABLE]
or
[TABLE]
where
[TABLE]
is called the incidence matrix. A probability distribution is a vertex iff for , implies , or in other words, iff .
Proof.
Note that for every ,
[TABLE]
(Sufficiency) If for some and some , then and , so that and , hence , and therefore is a vertex.
(Necessity) For every nonempty , there is a vector such that and . Let and with , so that with . Since is a vertex, and must not be elements of for all , or equivalently, . ∎
Below are several easy consequences of Proposition A.2.
Proposition A.3**.**
Let
[TABLE]
A probability distribution is a vertex iff is minimal in , where a minimal pattern in is a set such that for some and for every , implies .
Proposition A.4**.**
If is a vertex, then
[TABLE]
Sketch of Proof.
Because of (11), the equations have at most linearly independent equations. This number can be further reduced to by utilizing the information of , because all the variables with must be zero if the equation . The remaining part of the proof is then straightforward. ∎
Proposition A.4 provides an upper bound for the support size of a vertex in . On the other hand, the following result provides a lower bound for the support size of points in , including all the vertices of .
Proposition A.5**.**
For any ,
[TABLE]
where .
Proof.
By the definition of , we have
[TABLE]
Since is either [math] or , the right-hand side can yield at most different values, so that
[TABLE]
or .
On the other hand, every equation
[TABLE]
must have at least one positive for some
[TABLE]
Since for every , the sets , , …, are mutually disjoint, we conclude that . ∎
Algorithm A.6**.**
*Let be an arbitrary one-to-one map of onto . The following algorithm with and as arguments can yield a vertex of . *
function vertex()
, ,
while and do
end while
return
end function
Sketch of Proof.
Let be the vertex output by the algorithm. Let . Then by checking Algorithm A.6, it is easy to verify that for every , there exists an such that and for all with , so that . ∎
Remark A.7**.**
We can replace the map in Algorithm A.6 with some one-to-one map , where . Then we have a modified algorithm returning a pair such that
[TABLE]
Suppose the nontrivial case , so that . Let . If we have another algorithm to find a vertex of , say , then it is easy to show that is a vertex of .
Appendix B Properties of and
This section provides some basic results on the analytic properties of and defined in Section 3. For any ,
[TABLE]
is called the statistical distance on . Given the product space , we define its product metric by
[TABLE]
which induces the usual product topology. Thus for any channels , we have the channel distance
[TABLE]
Proposition B.1**.**
(a) is uniformly continuous, and it is convex in for fixed and is concave in for fixed .
(b) is uniformly continuous, and it is linear in for fixed and is concave in for fixed .
Proof.
(a) The function can be rewritten as where
[TABLE]
with . By Proposition B.4, for ,
[TABLE]
so that is uniformly continuous, and hence is uniformly continuous (Proposition B.6). It is also clear that is linear, so that is convex for fixed and is concave for fixed ([8, Theorem 2.7.4]).
(b) The function can be written as where . By Propositions B.3 and B.4, is uniformly continuous on and is bounded by . Then for and , we have
[TABLE]
which implies that is uniformly continuous. The remaining part is straightforward ([8, Theorem 2.7.4]). ∎
Proposition B.2**.**
For , is uniformly continuous and convex (and in fact linear for ).
Sketch of Proof.
Use Theorem B.1 and Proposition B.7 for or . The case of is trivial because is a linear function of . ∎
Proposition B.3** ([9, Theorem 2]).**
For and ,
[TABLE]
where .
Proposition B.4** (cf. [10, Lemma 3]).**
For and ,
[TABLE]
and
[TABLE]
Proposition B.5** (cf. [10, Lemma 3]).**
For and ,
[TABLE]
so that is uniformly continuous on .
Sketch of Proof.
Use the triangle inequality and Propositions B.3, B.4 and B.8. ∎
Proposition B.6**.**
Let be a map from to . If is uniformly continuous, then is uniformly continuous on , where and .
Sketch of Proof.
Use Propsoitions B.3 and B.5 and the observation that is a composition of uniformly continuous maps. ∎
Proposition B.7**.**
If is uniformly continuous on , then is uniformly continuous.
Proof.
Since is uniformly continuous, for any , there is a such that for any and any , implies . In other words, for any , implies . Then
[TABLE]
and similarly, , so that is uniformly continuous. ∎
Proposition B.8**.**
For and ,
[TABLE]
Proof.
[TABLE]
∎
Appendix C Proofs and Examples of Section 4.1
Proof of Proposition 4.4.
Let and be the column such that . It is clear that for some and some such that , so that , and hence , …, are not -minimized. ∎
Proof of Proposition 4.5.
Let and be the column of which all entries are less than . It is clear that for some and some such that , so that , and hence , …, are not -minimized. ∎
Proof of Proposition 4.6.
With no loss of generality, we assume that . It is then clear that
[TABLE]
where
[TABLE]
and
[TABLE]
It is clear that and , so that
[TABLE]
and therefore is not -maximized. ∎
Example C.1**.**
If
[TABLE]
which is the probability transition matrix seen in the well-known random binning scheme, then and (Theorems 3.3 and 3.4).
Example C.2**.**
[TABLE]
It can be computed using linear programming that and . The decompositions of for and are
[TABLE]
and
[TABLE]
respectively. Using Theorems 3.3 and 3.4 and Proposition 4.10, we have
[TABLE]
and
[TABLE]
From Proposition 4.4, it follows that the optimal decomposition for can have at most one perfect channel, so that , where
[TABLE]
is computed by the formula in Theorem 3.3. Then we have an improved bound: .
Appendix D Capacity-Achieving Input Probability Distributions
Let be a channel in . According to [11, Theorem 4.5.1], an input probability distribution maximizes the mutual information iff
[TABLE]
and
[TABLE]
where . Based on this sufficient and necessary condition, we have the following results concerning the support of capacity-achieving input probability distributions. In the sequel, we denote by the convex hull of all vectors in .
Proposition D.1**.**
Let . If all row vectors of are contained in , then there exists a capacity-achieving probability distribution such that .
Proof.
Let be a capacity-achieving probability distribution of the sub matrix . Extending with zero values, we obtain a probability distribution over . It is clear that
[TABLE]
and
[TABLE]
where . It remains to show that
[TABLE]
which is obvious, because
[TABLE]
for some nonnegative coefficients with . ∎
Proposition D.2**.**
Let be a capacity-achieving probability distribution of and let . For any and any , .
Proof.
It is clear that for all , where . We first show that , which corresponds to the case . If it is false, then
[TABLE]
where , , and . It is clear that for all , so that
[TABLE]
a contradiction. Now suppose that
[TABLE]
for some with . Let . Then
[TABLE]
where and . It is clear that , and therefore
[TABLE]
so that , which is absurd. ∎
Appendix E Counterexamples for Section 4.2
Example E.1**.**
* for*
[TABLE]
Proof.
Let . It is then clear that, for every ,
[TABLE]
If we define the map by
[TABLE]
then the row vector is always on the line segment with endpoints and .
By numerical computation, we know that
[TABLE]
where
[TABLE]
and
[TABLE]
is the capacity-achieving input distribution of . Furthermore, it can be verified that all points of satisfy
[TABLE]
This implies that , if extended to , cannot be a capacity-achieving distribution ([11, Theorem 4.5.1]). In other words, for every , the intrinsic capacity , so that . ∎
Example E.2**.**
Let state alphabet and let
[TABLE]
where
[TABLE]
and
[TABLE]
It is easy to show that is the capacity-achieving input distribution for , so that the output distribution is
[TABLE]
and . However, for the channel given by
[TABLE]
if we choose the map , then the corresponding row vector
[TABLE]
and . This implies that , if extended to , cannot be a capacity-achieving distribution for ([11, Theorem 4.5.1]). In other words, the capacity of can be increased by the causal state information at the encoder.
Appendix F and
Proof of Proposition 3.6.
Because , the binary uniform distribution is capacity-achieving for every deterministic channel, rank or rank . Thus we have for every . The remaining part is an easy consequence of Propositions 3.3 and 3.4. ∎
Proposition F.1**.**
Let be a channel . If all probabilities are distinct and the sum of each column of is greater than or equal to , then .
Proof.
By Proposition 4.10, , so that can be expressed as a convex combination of perfect channels and hence .
Let
[TABLE]
If , then there exists a such that the capacity-achieving input distribution, denoted , is capacity-achieving for every perfect channel . Thus at least one entry of must be . With no loss of generality, we assume .
If and are both positive, then is capacity-achieving only for perfect channels
[TABLE]
By Proposition A.5, every satisfies , which implies that is not capacity-achieving for .
If , then is capacity-achieving for perfect channels
[TABLE]
However, any convex combination of these four matrices can only yield a channel matrix with at most four distinct probability values, and hence is not capacity-achieving for .
In all cases, we have shown that is not capacity-achieving, which contradicts the assumption . Therefore, we have . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. A. El Gamal and Y.-H. Kim, Network Information Theory . Cambridge; New York: Cambridge University Press, 2011.
- 2[2] J. Wang, J. Chen, L. Zhao, P. Cuff, and P. Haim, “On the role of the refinement layer in multiple description coding and scalable coding,” IEEE Trans. Inf. Theory , vol. 57, no. 3, pp. 1443–1456, Mar. 2011.
- 3[3] H. Nikaidô, “On von Neumann’s minimax theorem,” Pacific Journal of Mathematics , vol. 4, no. 1, pp. 65–72, Mar. 1954.
- 4[4] W. Jurkat and H. Ryser, “Extremal configurations and decomposition theorems. I,” Journal of Algebra , vol. 8, no. 2, pp. 194–222, Feb. 1968.
- 5[5] R. M. Caron, X. Li, P. Mikusiński, H. Sherwood, and M. D. Taylor, “Nonsquare “doubly stochastic” matrices,” in Institute of Mathematical Statistics Lecture Notes - Monograph Series . Hayward, CA: Institute of Mathematical Statistics, 1996, pp. 65–75.
- 6[6] R. Xu, J. Chen, T. Weissman, and J.-K. Zhang, “When is noisy state information at the encoder as useless as no information or as good as noise-free state?” IEEE Trans. Inf. Theory , vol. 63, no. 2, pp. 960–974, Feb. 2017.
- 7[7] D. P. Bertsekas, A. Nedić, and A. E. Ozdaglar, Convex Analysis and Optimization , ser. Athena Scientific optimization and computation. Belmont, Mass: Athena Scientific, 2003, no. 1.
- 8[8] T. M. Cover and J. A. Thomas, Elements of Information Theory , 2nd ed. Hoboken, N.J: Wiley-Interscience, 2006.
